SGLang HiCache¶
Overview¶
Maru integrates with SGLang’s Hierarchical Cache (HiCache) as an L3 storage backend. HiCache organizes KV cache in three tiers:
Tier |
Storage |
Managed by |
|---|---|---|
L1 |
GPU VRAM |
SGLang |
L2 |
Host DRAM |
SGLang |
L3 |
CXL Shared Memory |
Maru |
When GPU memory pressure triggers eviction, KV pages flow down: L1 → L2 → L3. On cache hit, pages flow back up. Because Maru’s L3 is shared memory, all SGLang instances on the same node can read each other’s KV cache — enabling P2P sharing without network transfer.
Prerequisites¶
SGLang — install from main (required until the write-through idle fix
079a1fd
is included in a release):
pip install "sglang[all] @ git+https://github.com/sgl-project/sglang#subdirectory=python"
See SGLang installation docs for GPU-specific options.
Why main? The write-through idle fix (#20560) is required for L3 store to work — without it, write-through events are not processed when the scheduler is idle, so KV pages never reach L3 storage.
Configuration¶
SGLang loads Maru via the --hicache-storage-backend dynamic flag. All Maru
settings are passed through --hicache-storage-backend-extra-config as a JSON
string with maru_-prefixed keys.
sglang serve \
--model-path Qwen/Qwen2.5-7B \
--enable-hierarchical-cache \
--hicache-ratio 2.0 \
--hicache-write-policy write_through \
--hicache-mem-layout page_first_direct \
--hicache-storage-backend dynamic \
--hicache-storage-backend-extra-config '{
"backend_name": "maru",
"module_path": "maru_sglang.maru_storage",
"class_name": "MaruStorage",
"interface_v1": 1,
"maru_server_url": "tcp://localhost:5555",
"maru_pool_size": "4G"
}'
Important:
interface_v1must be set to1for the V1 API path (batch_get_v1/batch_set_v1) to be used. Without it, SGLang falls back to the legacy API which does not supportpage_first_directlayout.
HiCache parameters¶
Parameter |
Description |
|---|---|
|
Enable the HiCache system |
|
Host cache = 2× GPU cache size |
|
Persist to L3 on cache hit |
|
Page-first memory layout (required by MaruStorage) |
|
Load backend class dynamically |
Maru extra-config parameters¶
Parameter |
Default |
Description |
|---|---|---|
|
|
Set to |
|
|
MaruServer address |
|
|
CXL shared memory pool size ( |
|
|
Chunk size for memory allocation |
|
auto UUID |
Unique client instance identifier |
|
|
ZMQ socket timeout (ms) |
|
|
Async DEALER-ROUTER RPC |
|
|
Max concurrent in-flight async requests |
|
|
Pre-map all shared regions on connect |
Integration Architecture¶
sequenceDiagram
participant SG as SGLang HiCache
participant MHS as MaruStorage
participant MH as MaruHandler
participant MS as MaruServer
participant CXL as CXL Memory
Note over SG,CXL: Store Path (L2 → L3)
SG->>MHS: batch_set_v1(keys, host_indices)
MHS->>MHS: collect host pages as memoryviews
MHS->>MH: batch_store(keys, memoryviews)
MH->>CXL: alloc + memcpy (per key)
MH->>MS: batch_register_kv(keys, regions, offsets)
MH-->>MHS: [True, ...]
MHS-->>SG: [True, ...]
Note over SG,CXL: Retrieve Path (L3 → L2)
SG->>MHS: batch_get_v1(keys, host_indices)
MHS->>MH: batch_retrieve(keys)
MH->>MS: lookup_kv(keys)
MS-->>MH: region_id, offset, length
MH->>CXL: map region (if needed)
MH-->>MHS: [MemoryInfo, ...]
MHS->>CXL: ctypes.memmove(maru_buf → host_page)
MHS-->>SG: [True, ...]
The V1 API uses batch_store / batch_retrieve for single-RPC batch
operations. For MLA models (1 chunk/key), host pages are passed directly
as memoryviews. For non-MLA models (K+V in separate buffer pools),
K and V are concatenated into a contiguous buffer before passing to
batch_store (see TODO(KV-split-pool) in maru_storage.py).
Quick Start: Single Instance¶
cd examples/sglang/single/
# Start MaruServer + SGLang (Ctrl-C to stop)
bash single_example.sh
# In another terminal:
bash run_simple_query.sh
bash run_benchmark.sh
See examples/sglang/single/README.md for details.
P2P Sharing: Two Instances¶
cd examples/sglang/p2p_sharing/
# Start MaruServer + 2 SGLang instances (Ctrl-C to stop)
bash p2p_example.sh
# In another terminal:
bash run_simple_query.sh # Send query to inst1, then inst2 (cache hit check)
bash run_benchmark.sh # TTFT measurement
Success criteria:
Instance 2 logs show
batch_get_v1 result: N/N hits(100% cache hit)Benchmark reports TTFT speedup > 1.5×
See examples/sglang/p2p_sharing/README.md for details on the write-through flush mechanism.
Troubleshooting¶
Symptom |
Cause |
Fix |
|---|---|---|
|
MaruServer not running |
Start |
No cache hits on Instance 2 |
Write-through not flushed |
Send two queries to Instance 1 first (hit-count threshold) and wait ~3s |
|
Package not installed |
Run |
Low TTFT speedup (< 1.5×) |
Prompt too short for meaningful prefill savings |
Use longer prompts (> 500 tokens) |
See also: Architecture Overview, LMCache, Python API Reference