Examples

Code examples demonstrating XArith library usage in device kernels.

Quick Start

The examples below omit buffer allocation error checking for brevity. See Buffer Allocation for production-safe patterns.

L2 Distance

#include <xarith/xarith.hpp>

void computeExampleL2Distance(const float* vec1, const float* vec2,
                              float* result, uint64_t dimension)
{
    mu::vdma::VpeContext ctx(dimension);

    uint64_t buf1 = ctx.allocateBuffer();
    uint64_t buf2 = ctx.allocateBuffer();
    uint64_t diffBuf = ctx.allocateBuffer();

    ctx.load(vec1, buf1);
    ctx.load(vec2, buf2);
    ctx.sub(buf2, buf1, diffBuf);

    *result = ctx.dot(diffBuf, diffBuf);  // L2^2 = (vec2 - vec1) · (vec2 - vec1)
}
MU_KERNEL_ADD(computeExampleL2Distance);

Inner Product

void computeExampleInnerProduct(const float* vec1, const float* vec2,
                                float* result, uint64_t dimension)
{
    mu::vdma::VpeContext ctx(dimension, mu::vdma::VpeIdStrategy::ByClusterId);

    uint64_t buf1 = ctx.allocateBuffer();
    uint64_t buf2 = ctx.allocateBuffer();

    ctx.load(vec1, buf1);
    ctx.load(vec2, buf2);

    *result = ctx.dot(buf1, buf2);
}
MU_KERNEL_ADD(computeExampleInnerProduct);

Vector Addition

void computeExampleVectorAdd(const float* vec1, const float* vec2,
                             float* result, uint64_t dimension)
{
    mu::vdma::VpeContext ctx(dimension);

    uint64_t buf1 = ctx.allocateBuffer();
    uint64_t buf2 = ctx.allocateBuffer();
    uint64_t bufResult = ctx.allocateBuffer();

    ctx.load(vec1, buf1);
    ctx.load(vec2, buf2);

    ctx.add(buf1, buf2, bufResult);  // bufResult = buf1 + buf2

    ctx.store(bufResult, result);  // Write back to DRAM
}
MU_KERNEL_ADD(computeExampleVectorAdd);

CMake Integration

cmake_minimum_required(VERSION 3.11)

# Set MU toolchain paths from environment variables (set by SDK installer)
set(MU_LIB_PATH "$ENV{MU_LIB_PATH}")
set(MU_LLVM_PATH "$ENV{MU_LLVM_PATH}")
set(CMAKE_C_COMPILER ${MU_LLVM_PATH}/bin/clang)
set(CMAKE_CXX_COMPILER ${MU_LLVM_PATH}/bin/clang)

project(my_device_kernel LANGUAGES CXX)

# Find xarith package (installed at /opt/xarith/)
find_package(xarith REQUIRED)

# Create device kernel executable
add_executable(my_kernel.mubin my_kernel.cpp)

# Link xarith library
target_link_libraries(my_kernel.mubin PRIVATE xarith)

# Include directories
target_include_directories(my_kernel.mubin PRIVATE
    ${MU_LLVM_PATH}/picolibc-rv/include
    ${MU_LLVM_PATH}/libcxx-rv/include/c++/v1
    ${MU_LIB_PATH}/include
)

# Set MU-specific compile flags
target_compile_options(my_kernel.mubin PRIVATE
    -target riscv64-unknown-elf
    -march=rv64gcv_zvl256b_zve64x_zve64f_zve64d
    -mabi=lp64d
    -O2
    -fno-exceptions
    -fno-rtti
)

Example Project

The xarith-examples package includes complete Host + Device examples:

basic – L2 distance, inner product, vector add/sub with test runner
fp16 – FP16 VpeContext usage (element-wise add/mul, dot)
vpe – VPE operation testing with performance measurement (FP32)
vpe16 – VPE operation testing with performance measurement (FP16)
knn – VPE-accelerated KNN with L2 and dot product distance

Extract and build:

tar -xzf xarith-examples-<version>.tar.gz
cd xarith-examples-<version>
mkdir build && cd build
cmake -DCMAKE_PREFIX_PATH=/opt/xarith ..
make

basic

Runs four vector operation tests (L2 distance, inner product, vector add, vector sub) on the MU device.

./basic/compute_example_host

Expected output:

XArith Basic Example

L2 Distance Test
[Host] Executing kernel...
[Host] Expected: 128  Actual: 128
[Host] Test PASSED

Inner Product Test
[Host] Executing kernel...
[Host] Expected: 256  Actual: 256
[Host] Test PASSED

Vector Add Test
[Host] Executing kernel...
[Host] Test PASSED

Vector Sub Test
[Host] Executing kernel...
[Host] Test PASSED

4/4 tests passed

vpe

Tests VPE operations (dot, L2, add, sub) across different configurations and measures throughput (GFLOPS) — includes numSub sweep, taskCount sweep, dimension sweep, iteration sweep, and task ratio sweeps.

./vpe/compute_vpe

Environment Variables:

Variable	Default	Description
`DEVICEID`	0	Device ID

Sample output (abbreviated):

XArith VPE Example

...

Dot Product - Task Ratio x Dimension Matrix

 Op   | numSub | taskCount |   Dim  | Iter/Task  | Total Iters |   Time(us) |   GFLOPS   | Status
------+--------+-----------+--------+------------+-------------+------------+------------+--------
 dot  |     24 |        48 |    128 |    2000000 |    96000000 |        ... |        ... |    PASS
 dot  |     24 |        96 |    128 |    2000000 |   192000000 |        ... |        ... |    PASS
 dot  |     24 |       192 |    128 |    2000000 |   384000000 |        ... |        ... |    PASS
 dot  |     24 |       384 |    128 |    2000000 |   768000000 |        ... |        ... |    PASS
 ...
 dot  |     24 |        48 |   1024 |    2000000 |    96000000 |        ... |        ... |    PASS
 dot  |     24 |        96 |   1024 |    2000000 |   192000000 |        ... |        ... |    PASS
 dot  |     24 |       192 |   1024 |    2000000 |   384000000 |        ... |        ... |    PASS
 dot  |     24 |       384 |   1024 |    2000000 |   768000000 |        ... |        ... |    PASS

 16/16 passed

...

All benchmarks completed.

The example runs multiple test suites including numSub sweeps, taskCount sweeps, dimension sweeps (64–8192), iteration sweeps, task ratio sweeps for all four operations, and a cross-product dimension × task ratio matrix.

fp16

Minimal FP16 demo built on mu::vdma::VpeContext constructed with DataType::Fp16. Three host-side tests (element-wise add, element-wise mul, dot product) drive three device kernels.

DEVICEID=0 ./fp16/compute_fp16_host

vpe16

FP16 mirror of the vpe benchmark. Tests the same VPE operations (dot, dot2, L2) at the same sweep configurations, but vectors are stored as FP16. Reduction results stay FP32 because the HW accumulates FP16 reductions in FP32.

DEVICEID=0 ./vpe16/compute_vpe16

GFLOPS numbers can be compared head-to-head against the vpe (FP32) output – the sweep tables share the same (dim, taskCount, iterations) cells. Per-task correctness is checked against a host FP32 reference at 1% relative tolerance (the FP16 inputs lose precision, the FP32 reduction does not).

knn

VPE-accelerated KNN search with L2 and dot product distance.

./knn/knn_vpe_host          # run with defaults
./knn/knn_vpe_host --help   # see all options

Options:

Option	Default	Description
`--dim N`	128	Vector dimension
`--vectors N`	960000	Number of data points
`--k N`	5	Number of nearest neighbors

Environment Variables:

Variable	Default	Description
`DEVICEID`	0	Device ID
`NUMSUB`	24	Number of subsystems
`TASKCOUNT`	48	Number of subtasks for data partitioning

Sample output:

XArith KNN Example

Config: dim=128 points=960000 k=5 tasks=48 subs=24 (20000 points/task)

KNN VPE L2 Distance Test
[VPE]    Warmup (1 run): ...
[VPE]    Elapsed (5 runs): ...
[VPE]    min=... median=... max=...
[Result] Top-5:
  [0] label=... dist=...
  [1] label=... dist=...
  ...
[Verify] PASSED

KNN VPE Dot Product Test
[VPE]    Warmup (1 run): ...
[VPE]    Elapsed (5 runs): ...
[VPE]    min=... median=... max=...
[Result] Top-5:
  [0] label=... dist=...
  [1] label=... dist=...
  ...
[Verify] PASSED

2/2 tests passed

Tips

Buffer Allocation

Always check allocation results before using buffers:

uint64_t buf = ctx.allocateBuffer();
if (buf == mu::vdma::VpeContext::INVALID_BUFFER) {
    return;  // allocation failed
}

VpeContext supports up to 32 buffer slots, where each slot holds one FP32 vector of the configured dimension. Query availability with getAvailableBufferCount() before bulk allocation.

When multiple buffers are needed, use tryAllocateBuffers() for atomic all-or-nothing allocation. This prevents deadlock when multiple tasks compete for limited SRAM slots:

uint64_t bufs[3];
while (!ctx.tryAllocateBuffers(3, bufs)) {
    // retry until all 3 buffers are available
}

// use bufs[0], bufs[1], bufs[2] ...

ctx.freeBuffers(3, bufs);  // release all at once

For single-buffer cases, allocateBuffer() / freeBuffer() still work as before.

Performance

Minimize DRAM transfers – Keep data in SRAM buffers as long as possible; batch operations before storing results back
Choose VpeIdStrategy – Use ByThreadId for parallelism within an MU, ByClusterId for parallelism across clusters