Examples
Code examples demonstrating XArith library usage in device kernels.
Quick Start
The examples below omit buffer allocation error checking for brevity. See Buffer Allocation for production-safe patterns.
L2 Distance
#include <xarith/xarith.hpp>
void computeExampleL2Distance(const float* vec1, const float* vec2,
float* result, uint64_t dimension)
{
mu::vdma::VpeContext ctx(dimension);
uint64_t buf1 = ctx.allocateBuffer();
uint64_t buf2 = ctx.allocateBuffer();
uint64_t diffBuf = ctx.allocateBuffer();
ctx.load(vec1, buf1);
ctx.load(vec2, buf2);
ctx.sub(buf2, buf1, diffBuf);
*result = ctx.dot(diffBuf, diffBuf); // L2^2 = (vec2 - vec1) · (vec2 - vec1)
}
MU_KERNEL_ADD(computeExampleL2Distance);
Inner Product
void computeExampleInnerProduct(const float* vec1, const float* vec2,
float* result, uint64_t dimension)
{
mu::vdma::VpeContext ctx(dimension, mu::vdma::VpeIdStrategy::ByClusterId);
uint64_t buf1 = ctx.allocateBuffer();
uint64_t buf2 = ctx.allocateBuffer();
ctx.load(vec1, buf1);
ctx.load(vec2, buf2);
*result = ctx.dot(buf1, buf2);
}
MU_KERNEL_ADD(computeExampleInnerProduct);
Vector Addition
void computeExampleVectorAdd(const float* vec1, const float* vec2,
float* result, uint64_t dimension)
{
mu::vdma::VpeContext ctx(dimension);
uint64_t buf1 = ctx.allocateBuffer();
uint64_t buf2 = ctx.allocateBuffer();
uint64_t bufResult = ctx.allocateBuffer();
ctx.load(vec1, buf1);
ctx.load(vec2, buf2);
ctx.add(buf1, buf2, bufResult); // bufResult = buf1 + buf2
ctx.store(bufResult, result); // Write back to DRAM
}
MU_KERNEL_ADD(computeExampleVectorAdd);
CMake Integration
cmake_minimum_required(VERSION 3.11)
# Set MU toolchain paths from environment variables (set by SDK installer)
set(MU_LIB_PATH "$ENV{MU_LIB_PATH}")
set(MU_LLVM_PATH "$ENV{MU_LLVM_PATH}")
set(CMAKE_C_COMPILER ${MU_LLVM_PATH}/bin/clang)
set(CMAKE_CXX_COMPILER ${MU_LLVM_PATH}/bin/clang)
project(my_device_kernel LANGUAGES CXX)
# Find xarith package (installed at /opt/xarith/)
find_package(xarith REQUIRED)
# Create device kernel executable
add_executable(my_kernel.mubin my_kernel.cpp)
# Link xarith library
target_link_libraries(my_kernel.mubin PRIVATE xarith)
# Include directories
target_include_directories(my_kernel.mubin PRIVATE
${MU_LLVM_PATH}/picolibc-rv/include
${MU_LLVM_PATH}/libcxx-rv/include/c++/v1
${MU_LIB_PATH}/include
)
# Set MU-specific compile flags
target_compile_options(my_kernel.mubin PRIVATE
-target riscv64-unknown-elf
-march=rv64gcv_zvl256b_zve64x_zve64f_zve64d
-mabi=lp64d
-O2
-fno-exceptions
-fno-rtti
)
Example Project
The xarith-examples package includes complete Host + Device examples:
- basic – L2 distance, inner product, vector add/sub with test runner
- vpe – VPE operation testing with performance measurement
- knn – VPE-accelerated KNN with L2 and dot product distance
Extract and build:
tar -xzf xarith-examples-<version>.tar.gz
cd xarith-examples-<version>
mkdir build && cd build
cmake -DCMAKE_PREFIX_PATH=/opt/xarith ..
make
basic
Runs four vector operation tests (L2 distance, inner product, vector add, vector sub) on the MU device.
./basic/compute_example_host
Expected output:
XArith Basic Example
L2 Distance Test
[Host] Executing kernel...
[Host] Expected: 128 Actual: 128
[Host] Test PASSED
Inner Product Test
[Host] Executing kernel...
[Host] Expected: 256 Actual: 256
[Host] Test PASSED
Vector Add Test
[Host] Executing kernel...
[Host] Test PASSED
Vector Sub Test
[Host] Executing kernel...
[Host] Test PASSED
4/4 tests passed
vpe
Tests VPE operations (dot, L2, add, sub) across different configurations and measures throughput (GFLOPS) — includes numSub sweep, taskCount sweep, dimension sweep, iteration sweep, and task ratio sweeps.
./vpe/compute_vpe
Environment Variables:
| Variable | Default | Description |
|---|---|---|
DEVICEID | 0 | Device ID |
Sample output (abbreviated):
XArith VPE Example
...
Dot Product - Task Ratio x Dimension Matrix
Op | numSub | taskCount | Dim | Iter/Task | Total Iters | Time(us) | GFLOPS | Status
------+--------+-----------+--------+------------+-------------+------------+------------+--------
dot | 24 | 48 | 128 | 2000000 | 96000000 | ... | ... | PASS
dot | 24 | 96 | 128 | 2000000 | 192000000 | ... | ... | PASS
dot | 24 | 192 | 128 | 2000000 | 384000000 | ... | ... | PASS
dot | 24 | 384 | 128 | 2000000 | 768000000 | ... | ... | PASS
...
dot | 24 | 48 | 1024 | 2000000 | 96000000 | ... | ... | PASS
dot | 24 | 96 | 1024 | 2000000 | 192000000 | ... | ... | PASS
dot | 24 | 192 | 1024 | 2000000 | 384000000 | ... | ... | PASS
dot | 24 | 384 | 1024 | 2000000 | 768000000 | ... | ... | PASS
16/16 passed
...
All benchmarks completed.
The example runs multiple test suites including numSub sweeps, taskCount sweeps, dimension sweeps (64–8192), iteration sweeps, task ratio sweeps for all four operations, and a cross-product dimension × task ratio matrix.
knn
VPE-accelerated KNN search with L2 and dot product distance.
./knn/knn_vpe_host # run with defaults
./knn/knn_vpe_host --help # see all options
Options:
| Option | Default | Description |
|---|---|---|
--dim N | 128 | Vector dimension |
--vectors N | 960000 | Number of data points |
--k N | 5 | Number of nearest neighbors |
Environment Variables:
| Variable | Default | Description |
|---|---|---|
DEVICEID | 0 | Device ID |
NUMSUB | 24 | Number of subsystems |
TASKCOUNT | 48 | Number of subtasks for data partitioning |
Sample output:
XArith KNN Example
Config: dim=128 points=960000 k=5 tasks=48 subs=24 (20000 points/task)
KNN VPE L2 Distance Test
[VPE] Warmup (1 run): ...
[VPE] Elapsed (5 runs): ...
[VPE] min=... median=... max=...
[Result] Top-5:
[0] label=... dist=...
[1] label=... dist=...
...
[Verify] PASSED
KNN VPE Dot Product Test
[VPE] Warmup (1 run): ...
[VPE] Elapsed (5 runs): ...
[VPE] min=... median=... max=...
[Result] Top-5:
[0] label=... dist=...
[1] label=... dist=...
...
[Verify] PASSED
2/2 tests passed
Tips
Buffer Allocation
Always check allocation results before using buffers:
uint64_t buf = ctx.allocateBuffer();
if (buf == mu::vdma::VpeContext::INVALID_BUFFER) {
return; // allocation failed
}
VpeContext supports up to 32 buffer slots, where each slot holds one FP32 vector of the configured dimension. Query availability with getAvailableBufferCount() before bulk allocation.
When multiple buffers are needed, use tryAllocateBuffers() for atomic all-or-nothing allocation. This prevents deadlock when multiple tasks compete for limited SRAM slots:
uint64_t bufs[3];
while (!ctx.tryAllocateBuffers(3, bufs)) {
// retry until all 3 buffers are available
}
// use bufs[0], bufs[1], bufs[2] ...
ctx.freeBuffers(3, bufs); // release all at once
For single-buffer cases, allocateBuffer() / freeBuffer() still work as before.
Performance
- Minimize DRAM transfers – Keep data in SRAM buffers as long as possible; batch operations before storing results back
- Choose VpeIdStrategy – Use
ByThreadIdfor parallelism within an MU,ByClusterIdfor parallelism across clusters