LMCache Integration¶
Integration Architecture¶
The full stack from inference engine to shared memory:

Control Plane (dashed arrows) — metadata RPC: KV registration, region claim/release.
Data Plane (solid arrows) — mmap direct CXL read/write, zero-copy.
Component Architecture¶

Layer responsibilities:
Layer |
Responsibility |
Scope |
|---|---|---|
LMCache stack |
Inference engine → CacheEngine → StorageManager → RemoteBackend |
LMCache (external) |
MaruConnector |
Adapts LMCache’s RemoteConnector to MaruHandler’s API |
Integration boundary |
MaruHandler |
Client-side KV operations, memory mapping, connection management |
Maru client |
MaruServer |
Central metadata store, memory allocation coordinator |
Maru server |
The integration boundary sits at MaruConnector. Everything above is LMCache; everything below is Maru. MaruConnector is the only component that imports from both projects.
Connector Design¶
LMCache defines a RemoteConnector interface that all remote storage backends
must implement (exists, get, put, close, and batch variants). MaruConnector
implements this interface by delegating to MaruHandler.
Why the connector pattern: LMCache’s RemoteBackend is designed for pluggable storage. The same StorageManager can use Redis, S3, Mooncake, or Maru without any change to the cache engine logic. MaruConnector slots in as one such plugin.
The key translation between the two APIs involves:
Key conversion — LMCache uses structured
CacheEngineKeyobjects; MaruHandler uses string keys (CacheEngineKey.to_string()).Zero-copy bridging — MaruHandler returns
MemoryInfo(a memoryview wrapper) which the connector wraps as LMCache’sMemoryObjwithout copying data.Batch optimization — The connector maps LMCache’s batch operations to MaruHandler’s batch RPC calls, reducing round-trip overhead.
Data Path¶
Store Path (write)¶
When the inference engine produces new KV cache data:
sequenceDiagram
participant IE as Inference Engine
participant CE as CacheEngine
participant MC as MaruConnector
participant MH as MaruHandler
participant MS as MaruServer
participant CXL as CXL Memory
IE->>CE: KV tensors (GPU)
CE->>MC: put(key, MemoryObj)
MC->>MH: alloc(size)
MH-->>MC: handle (page in CXL region)
MC->>CXL: write data via handle buffer (zero-copy)
MC->>MH: store(key, handle)
MH->>MS: register_kv(key, region_id, offset, length)
MS-->>MH: success
MH-->>MC: True
MC-->>CE: done
Retrieve Path (read)¶
When the inference engine needs cached KV data:
sequenceDiagram
participant IE as Inference Engine
participant CE as CacheEngine
participant MC as MaruConnector
participant MH as MaruHandler
participant MS as MaruServer
participant CXL as CXL Memory
IE->>CE: Request KV for prompt prefix
CE->>MC: get(key)
MC->>MH: retrieve(key)
MH->>MS: lookup_kv(key)
MS-->>MH: region_id, offset, length
MH->>CXL: Map shared region (if not already mapped)
MH-->>MC: MemoryInfo (zero-copy memoryview)
MC->>MC: Wrap as MemoryObj (zero-copy)
MC-->>CE: MemoryObj
CE-->>IE: KV tensors
The key design point is that data never travels over the network. Only metadata (region ID, offset, length) is exchanged via RPC. The actual KV tensor data is accessed directly from CXL shared memory through memory-mapped regions.
Configuration¶
Maru is loaded as an LMCache remote storage plugin (requires LMCache >= v0.3.14). Configuration is done via the LMCache YAML config file.
chunk_size: 256
local_cpu: True
max_local_cpu_size: 5
enable_async_loading: True
# Disable P2P for Maru shared storage mode
enable_p2p: False
enable_controller: False
# Maru backend — format: maru://<host>:<port>
remote_url: "maru://localhost:5555"
remote_serde: "naive"
remote_storage_plugins: ["maru"]
extra_config:
remote_storage_plugin.maru.module_path: maru_lmcache.adapter
remote_storage_plugin.maru.class_name: MaruConnectorAdapter
maru_pool_size: "4G" # CXL memory pool size ("1G", "500M", etc.)
save_chunk_meta: False
lookup_backoff_time: 0.001
# maru_instance_id: "my-id" # Unique client ID (default: auto UUID)
# maru_operation_timeout: 10.0 # Per-operation timeout in seconds
# maru_timeout_ms: 2000 # ZMQ socket timeout (ms)
# maru_use_async_rpc: true # Async DEALER-ROUTER RPC
# maru_max_inflight: 64 # Max in-flight async requests
Plugin settings¶
Field |
Description |
|---|---|
|
Registers Maru as a plugin backend |
|
Python module containing the adapter class |
|
Adapter class name ( |
Maru extra_config parameters¶
Parameter |
Default |
Description |
|---|---|---|
|
|
CXL memory pool size. Supports human-readable strings ( |
|
auto-generated UUID |
Unique client instance identifier |
|
|
Timeout in seconds for individual KV operations |
|
|
ZMQ socket timeout in milliseconds for RPC communication |
|
|
Use async DEALER-ROUTER pattern for higher throughput |
|
|
Max concurrent in-flight async RPC requests |
|
(from |
Override server URL. Normally not needed |
|
|
Auto-connect to MaruServer on initialization |
|
|
Pre-map all shared regions on connect |
For runnable examples, see LMCache Examples.