Disaggregated Prefill (PD)

This example demonstrates how to run disaggregated prefill with vLLM and LMCache using Maru as the shared KV cache backend.

Overview

Disaggregated prefill separates the compute-intensive prefill phase from the memory-intensive decode phase across different GPU instances. The prefiller generates KV cache and stores it into CXL shared memory via Maru. The decoder retrieves it directly from shared CXL memory. This eliminates network transfer overhead between prefiller and decoder, as both access the same shared memory directly.

Prerequisites

  • At least 2 GPUs

  • LMCache >= v0.3.14 installed (pip install lmcache)

  • vLLM installed

  • Maru installed (see Installation)

Configuration

Prefiller / Decoder config

Both prefiller and decoder use the same configuration:

enable_pd: False
chunk_size: 256
remote_url: "maru://localhost:${MARU_SERVER_PORT}"
remote_serde: "naive"
remote_storage_plugins: ["maru"]
local_cpu: False
max_local_cpu_size: 100
save_unfull_chunk: True

extra_config:
  remote_storage_plugin.maru.module_path: maru_lmcache.adapter
  remote_storage_plugin.maru.class_name: MaruConnectorAdapter
  maru_pool_size: "4G"
  save_chunk_meta: False
  lookup_backoff_time: 0.001

Maru is loaded as an LMCache remote storage plugin. For details on each configuration field, see LMCache Integration.

How to Run

(Optional) Create and activate a virtual environment:

python3 -m venv .venv
source .venv/bin/activate

1. Launch the PD setup

The launcher script starts MaruServer, prefiller, decoder, and proxy automatically:

cd examples/lmcache/disagg_prefill/1p1d
./disagg_example_1p1d.sh

Wait until you see:

All servers are up. You can send request now...

2. Try a simple query

Open a new terminal and send a single prompt through the proxy:

cd examples/lmcache/disagg_prefill/1p1d

# Send a prompt — the proxy routes it to prefiller (KV generation) then decoder (token generation)
./run_simple_query.sh

You’ll see the prompt and the generated response printed directly. Check decoder.log for cache hit messages:

LMCache INFO: [req_id=cmpl-a5a94ea4577d4025-0] Retrieved 256 out of 256 required tokens (from 256 total tokens). size: 0.0029 gb, cost 3.0579 ms, throughput: 0.9581 GB/s; (cache_engine.py:874:lmcache.v1.cache_engine)

3. Run a benchmark

Once you’ve confirmed the setup works, measure throughput with a larger workload:

./run_benchmark.sh

This runs vllm bench serve with 30 random prompts against the proxy, measuring request throughput and latency under disaggregated inference.

Press Ctrl+C in the first terminal to stop all servers.