vLLM

Prerequisites

vLLM v0.14+ — required for KVConnectorBase_V1 support:

pip install vllm

See vLLM installation docs for GPU-specific options.

Integration Architecture

MaruKVConnector is a native vLLM KV connector that enables direct KV cache sharing between vLLM instances through CXL shared memory — without any middleware.

        flowchart LR
    subgraph prev["Previous (via LMCache)"]
        direction TB
        V1["vLLM"] --> LC["LMCacheConnector"] --> LE["LMCache Engine"]
        LE --> SM["StorageManager"] --> MC["MaruConnector"]
        MC --> MH1["MaruHandler"] --> CXL1["CXL"]
    end

    subgraph direct["Direct (this connector)"]
        direction TB
        V2["vLLM"] --> MKV["MaruKVConnector"]
        MKV --> MH2["MaruHandler"] --> CXL2["CXL"]
    end

    prev --> direct
    

By removing the LMCache middleware layer, the direct connector achieves:

  • Fewer dependencies — only vLLM + Maru

  • Zero-copy save path — GPU → CXL via single cudaMemcpy (no intermediate CPU buffer)

  • Zero-copy load path — CXL mmap (CUDA pinned) → GPU via DMA

  • No serialization overhead — raw tensor bytes, no MemoryObj conversion

Component Roles

MaruKVConnector implements vLLM’s KVConnectorBase_V1 interface with a dual-role design:

Role

Component

Responsibility

SCHEDULER

MaruSchedulerConnector

Checks chunk-by-chunk which prefix is cached; builds metadata for worker

WORKER

MaruWorkerConnector

Performs actual GPU ↔ CXL data transfers per chunk per layer

Both roles share the same MaruHandler connection to CXL shared memory.

Data Path

Store Path (GPU → CXL)

When a vLLM instance completes prefill, the connector stores KV cache in chunks:

        sequenceDiagram
    participant vLLM as vLLM Worker
    participant MKV as MaruKVConnector
    participant MH as MaruHandler
    participant MS as MaruServer
    participant CXL as CXL Memory

    vLLM->>MKV: save_kv_layer(layer, kv_tensor, attn_metadata)
    loop For each chunk (256 tokens)
        MKV->>MH: alloc(nbytes)
        MH-->>MKV: handle (CXL page)
        MKV->>CXL: dst.copy_(gpu_tensor) — single cudaMemcpy
        MKV->>MH: store(key, handle=handle) — register only, no memcpy
        MH->>MS: register_kv(key, region_id, offset, length)
    end
    

The save path uses handler.alloc() to get a pre-mapped CXL buffer, then copies GPU tensor data directly into it via torch.Tensor.copy_(). The subsequent store(handle=) call only registers the key in the metadata server — no additional data copy occurs.

Load Path (CXL → GPU)

When a second instance receives a request with a matching prefix:

        sequenceDiagram
    participant Sched as Scheduler
    participant MKV as MaruKVConnector
    participant MH as MaruHandler
    participant MS as MaruServer
    participant CXL as CXL Memory
    participant GPU as GPU Memory

    Sched->>MKV: get_num_new_matched_tokens()
    MKV->>MH: exists(chunk_key) per chunk
    MH->>MS: lookup_kv(key)
    MH-->>MKV: hit count (e.g., 3 of 4 chunks)

    Note over MKV: Worker phase
    MKV->>MH: retrieve(key) per chunk per layer
    MH-->>MKV: MemoryInfo (CXL mmap memoryview)
    MKV->>MKV: torch.frombuffer(info.view, dtype)
    MKV->>GPU: .to(device) — CXL→GPU DMA (pinned via cudaHostRegister)
    MKV->>MKV: Inject into KV cache layer via slot mapping
    

The CXL mmap region is pinned via cudaHostRegister by MaruHandler’s DaxMapper, so .to(device) triggers a direct DMA transfer from CXL to GPU memory without any intermediate CPU copy.

Chunk-Based Storage

Tokens are divided into fixed-size chunks (default 256 tokens) for storage:

Prompt: [tok0..tok255 | tok256..tok511 | tok512..tok767 | tok768..tok900]
         chunk 0        chunk 1          chunk 2          (incomplete, not stored)

Each chunk key = kv_{hash(tok0..end)}_L{layer} — a rolling prefix hash that encodes the full context up to that chunk, enabling partial prefix reuse.

Partial Prefix Reuse

Instance A:
  Request: "The quick brown fox jumps over the lazy dog. Once upon a time..."
  → Stores chunk 0, 1, 2

Instance B:
  Request: "The quick brown fox jumps over the lazy dog. In a galaxy far away..."
  → chunk 0, 1 hit (common prefix), chunk 2 miss
  → Loads chunk 0, 1 from CXL, computes the rest

Setup

Start Maru server:

maru-server
# Listens on tcp://0.0.0.0:5555 by default

Launch vLLM with MaruKVConnector (dynamic loading):

vllm serve <model> \
    --kv-transfer-config '{
        "kv_connector": "MaruKVConnector",
        "kv_connector_module_path": "maru_vllm",
        "kv_role": "kv_both",
        "kv_connector_extra_config": {
            "maru_server_url": "tcp://localhost:5555",
            "maru_pool_size": "4G"
        }
    }'

The kv_connector_module_path tells vLLM to dynamically import MaruKVConnector from the maru_vllm package. No vLLM source code changes are required.

Second instance (same node):

vllm serve <model> \
    --port 8001 \
    --kv-transfer-config '{
        "kv_connector": "MaruKVConnector",
        "kv_connector_module_path": "maru_vllm",
        "kv_role": "kv_both",
        "kv_connector_extra_config": {
            "maru_server_url": "tcp://localhost:5555",
            "maru_pool_size": "4G"
        }
    }'

Configuration

Settings in kv_connector_extra_config:

Parameter

Type

Default

Description

maru_server_url

str

tcp://localhost:5555

MaruServer address

maru_pool_size

str/int

1G

CXL memory pool size (4G, 500M, etc.)

maru_chunk_size

str/int

4M

Maru page size (CXL allocation unit)

maru_instance_id

str

auto

Unique instance ID (default: auto-generated UUID)

maru_eager_map

bool

true

Pre-map other instances’ CXL regions on connect

maru_kv_chunk_tokens

int

256

KV cache chunk granularity (in tokens)

maru_kv_chunk_tokens

Controls how many tokens per chunk when storing KV cache:

  • Smaller (64, 128): Finer prefix reuse granularity, more maru keys

  • Larger (512, 1024): Fewer keys, but coarser reuse granularity

  • Default 256: Good balance for most use cases

  • Auto-aligned: Automatically adjusted to a multiple of vLLM block_size

maru_pool_size

CXL memory allocated per instance. Capacity estimation:

pool_size ≈ num_layers × kv_head_dim × num_kv_heads × 2(K+V) × max_cached_tokens × dtype_bytes

Example (Llama 7B, fp16):

32 layers × 128 head_dim × 32 heads × 2(K+V) × 4096 tokens × 2 bytes ≈ 2GB

Comparison with LMCache Path

Aspect

Via LMCache

Direct (this connector)

Dependencies

vLLM + LMCache + maru

vLLM + maru

Middleware

LMCache Engine, StorageManager, RemoteBackend

None

Serialization

LMCache MemoryObj conversion

torch tensor ↔ bytes direct

Prefix matching

LMCache CacheEngineKey hashing

vLLM token prefix hashing

Configuration

LMCACHE_CONFIG_FILE YAML

kv_connector_extra_config JSON

Save path

GPU → CPU → bytes → alloc + memcpy → CXL

GPU → CXL (single DMA via alloc)

Load path

CXL → clone → CPU → GPU

CXL → GPU (single DMA, pinned)

Troubleshooting

MaruServer Connection Failure

ERROR: Failed to connect to MaruServer at tcp://localhost:5555

Verify maru-server is running and accessible.

CXL Memory Exhausted

ERROR: Cannot allocate page for key ...

Increase maru_pool_size or restart maru-server to free memory.

chunk_tokens Alignment Warning

WARNING: maru_kv_chunk_tokens 300 not aligned to block_size 16, adjusted to 288

Normal behavior. Automatically adjusted to a multiple of vLLM’s block_size.

BFloat16 Store Errors

TypeError: Got unsupported ScalarType BFloat16
# or
RuntimeError: can't convert bfloat16 to numpy

BFloat16 models (e.g., Llama 3) fail with numpy-based serialization because numpy has no bfloat16 dtype. The connector handles this by using torch.Tensor.contiguous() and raw byte views instead of .numpy(). If you see this error, ensure you are using the connector from this PR or later — older versions may use numpy conversion paths.

Garbage Output on Second Instance

KV cache data corruption usually means chunks were concatenated as 1D bytes instead of being injected per-chunk. Ensure per-chunk injection is used (the current connector handles this correctly).

For runnable examples, see vLLM Examples.