LMCache¶

Integration Architecture¶

The full stack from inference engine to shared memory:

System-level architecture

Control Plane (dashed arrows) — metadata RPC: KV registration, region claim/release.

Data Plane (solid arrows) — mmap direct CXL read/write, zero-copy.

Component Architecture¶

Component-level architecture

Layer responsibilities:

Layer	Responsibility	Scope
LMCache stack	Inference engine → CacheEngine → StorageManager → MaruBackend	LMCache (external)
MaruBackend	LMCache `AllocatorBackendInterface` — allocates directly on CXL, async store, sync get	Integration boundary
CxlMemoryAdapter	LMCache `MemoryAllocatorInterface` — translates Maru pages to `TensorMemoryObj` pool	Integration boundary
MaruHandler	Client-side KV operations, memory mapping, connection management	Maru client
MaruServer	Central metadata store, memory allocation coordinator	Maru server

The integration boundary sits at MaruBackend + CxlMemoryAdapter. Everything above is LMCache; everything below is Maru. These two classes are the only components that import from both projects.

Backend Design¶

Two-layer integration¶

MaruBackend (AllocatorBackendInterface)
  ├── CxlMemoryAdapter (MemoryAllocatorInterface)
  │     ├── _pool: {region_id: [TensorMemoryObj per page]}
  │     └── address encoding: (rid << 32) | pid
  └── MaruHandler (Maru client)
        ├── RpcClient → MaruServer
        ├── DaxMapper (mmap management)
        └── OwnedRegionManager (page allocation)

MaruHandler manages CXL memory (regions, pages, mmap). CxlMemoryAdapter translates pages into LMCache’s TensorMemoryObj format.

Data Path¶

Store Path (write)¶

When the inference engine produces new KV cache data:

        sequenceDiagram
    participant IE as Inference Engine
    participant CE as CacheEngine
    participant MB as MaruBackend
    participant MH as MaruHandler
    participant MS as MaruServer
    participant CXL as CXL Memory

    IE->>CE: KV tensors (GPU)
    CE->>MB: allocate(size)
    MB->>MH: alloc(size)
    MH-->>MB: handle (page in CXL region)
    MB-->>CE: MemoryObj (CXL-backed)
    CE->>CXL: GPU → CXL direct copy (only data copy)
    CE->>MB: put(key, MemoryObj)
    MB->>MH: store(key, handle)
    MH->>MS: register_kv(key, region_id, offset, length)
    MS-->>MH: success
    MH-->>MB: True
    MB-->>CE: done

Retrieve Path (read)¶

When the inference engine needs cached KV data:

        sequenceDiagram
    participant IE as Inference Engine
    participant CE as CacheEngine
    participant MB as MaruBackend
    participant MH as MaruHandler
    participant MS as MaruServer
    participant CXL as CXL Memory

    IE->>CE: Request KV for prompt prefix
    CE->>MB: get(key)
    MB->>MH: retrieve(key)
    MH->>MS: lookup_kv(key)
    MS-->>MH: region_id, offset, length
    MH->>CXL: Map shared region (if not already mapped)
    MH-->>MB: MemoryInfo (zero-copy memoryview)
    MB-->>CE: MemoryObj (points to CXL mmap, zero-copy)
    CE-->>IE: KV tensors

The key design point is that data never travels over the network. Only metadata (region ID, offset, length) is exchanged via RPC. The actual KV tensor data is accessed directly from CXL shared memory through memory-mapped regions.

Configuration¶

Maru is configured as a native LMCache storage backend via the maru_path and maru_pool_size config fields. No plugin registration is needed.

chunk_size: 256
local_cpu: False
max_local_cpu_size: 0
save_unfull_chunk: True

# Maru backend
maru_path: "maru://localhost:5555"
maru_pool_size: 4

extra_config:
  lookup_backoff_time: 0.001
  # maru_instance_id: "my-id"       # Unique client ID (default: auto UUID)
  # maru_timeout_ms: 5000           # ZMQ socket timeout (ms)
  # maru_use_async_rpc: true        # Async DEALER-ROUTER RPC
  # maru_max_inflight: 64           # Max in-flight async requests
  # maru_eager_map: true            # Pre-map shared regions on connect

MaruBackend settings¶

Field	Default	Description
`maru_path`	(required)	MaruServer address. Format: `maru://<host>:<port>`
`maru_pool_size`	`4`	CXL memory pool size in GB

Maru extra_config parameters¶

Parameter	Default	Description
`maru_instance_id`	auto-generated UUID	Unique client instance identifier
`maru_timeout_ms`	`5000`	ZMQ socket timeout in milliseconds for RPC communication
`maru_use_async_rpc`	`true`	Use async DEALER-ROUTER pattern for higher throughput
`maru_max_inflight`	`64`	Max concurrent in-flight async RPC requests
`maru_eager_map`	`true`	Pre-map all shared regions on connect

For runnable examples, see LMCache Examples.

See also: Architecture Overview, MaruHandler Design, Python API Reference, Configuration Reference