Examples

Code examples demonstrating XArith library usage in device kernels.

Quick Start

The examples below omit buffer allocation error checking for brevity. See Buffer Allocation for production-safe patterns.

L2 Distance

#include <xarith/xarith.hpp>

void computeExampleL2Distance(const float* vec1, const float* vec2,
                              float* result, uint64_t dimension)
{
    mu::vdma::VpeContext ctx(dimension);

    uint64_t buf1 = ctx.allocateBuffer();
    uint64_t buf2 = ctx.allocateBuffer();
    uint64_t diffBuf = ctx.allocateBuffer();

    ctx.load(vec1, buf1);
    ctx.load(vec2, buf2);
    ctx.sub(buf2, buf1, diffBuf);

    *result = ctx.dot(diffBuf, diffBuf);  // L2^2 = (vec2 - vec1) · (vec2 - vec1)
}
MU_KERNEL_ADD(computeExampleL2Distance);

Inner Product

void computeExampleInnerProduct(const float* vec1, const float* vec2,
                                float* result, uint64_t dimension)
{
    mu::vdma::VpeContext ctx(dimension, mu::vdma::VpeIdStrategy::ByClusterId);

    uint64_t buf1 = ctx.allocateBuffer();
    uint64_t buf2 = ctx.allocateBuffer();

    ctx.load(vec1, buf1);
    ctx.load(vec2, buf2);

    *result = ctx.dot(buf1, buf2);
}
MU_KERNEL_ADD(computeExampleInnerProduct);

Vector Addition

void computeExampleVectorAdd(const float* vec1, const float* vec2,
                             float* result, uint64_t dimension)
{
    mu::vdma::VpeContext ctx(dimension);

    uint64_t buf1 = ctx.allocateBuffer();
    uint64_t buf2 = ctx.allocateBuffer();
    uint64_t bufResult = ctx.allocateBuffer();

    ctx.load(vec1, buf1);
    ctx.load(vec2, buf2);

    ctx.add(buf1, buf2, bufResult);  // bufResult = buf1 + buf2

    ctx.store(bufResult, result);  // Write back to DRAM
}
MU_KERNEL_ADD(computeExampleVectorAdd);

CMake Integration

cmake_minimum_required(VERSION 3.11)

# Set MU toolchain paths from environment variables (set by SDK installer)
set(MU_LIB_PATH "$ENV{MU_LIB_PATH}")
set(MU_LLVM_PATH "$ENV{MU_LLVM_PATH}")
set(CMAKE_C_COMPILER ${MU_LLVM_PATH}/bin/clang)
set(CMAKE_CXX_COMPILER ${MU_LLVM_PATH}/bin/clang)

project(my_device_kernel LANGUAGES CXX)

# Find xarith package (installed at /opt/xarith/)
find_package(xarith REQUIRED)

# Create device kernel executable
add_executable(my_kernel.mubin my_kernel.cpp)

# Link xarith library
target_link_libraries(my_kernel.mubin PRIVATE xarith)

# Include directories
target_include_directories(my_kernel.mubin PRIVATE
    ${MU_LLVM_PATH}/picolibc-rv/include
    ${MU_LLVM_PATH}/libcxx-rv/include/c++/v1
    ${MU_LIB_PATH}/include
)

# Set MU-specific compile flags
target_compile_options(my_kernel.mubin PRIVATE
    -target riscv64-unknown-elf
    -march=rv64gcv_zvl256b_zve64x_zve64f_zve64d
    -mabi=lp64d
    -O2
    -fno-exceptions
    -fno-rtti
)

Example Project

The xarith-examples package includes complete Host + Device examples:

  • basic – L2 distance, inner product, vector add/sub with test runner
  • vpe – VPE operation testing with performance measurement
  • knn – VPE-accelerated KNN with L2 and dot product distance

Extract and build:

tar -xzf xarith-examples-<version>.tar.gz
cd xarith-examples-<version>
mkdir build && cd build
cmake -DCMAKE_PREFIX_PATH=/opt/xarith ..
make

basic

Runs four vector operation tests (L2 distance, inner product, vector add, vector sub) on the MU device.

./basic/compute_example_host

Expected output:

XArith Basic Example

L2 Distance Test
[Host] Executing kernel...
[Host] Expected: 128  Actual: 128
[Host] Test PASSED

Inner Product Test
[Host] Executing kernel...
[Host] Expected: 256  Actual: 256
[Host] Test PASSED

Vector Add Test
[Host] Executing kernel...
[Host] Test PASSED

Vector Sub Test
[Host] Executing kernel...
[Host] Test PASSED

4/4 tests passed

vpe

Tests VPE operations (dot, L2, add, sub) across different configurations and measures throughput (GFLOPS) — includes numSub sweep, taskCount sweep, dimension sweep, iteration sweep, and task ratio sweeps.

./vpe/compute_vpe

Environment Variables:

Variable Default Description
DEVICEID 0 Device ID

Sample output (abbreviated):

XArith VPE Example

...

Dot Product - Task Ratio x Dimension Matrix

 Op   | numSub | taskCount |   Dim  | Iter/Task  | Total Iters |   Time(us) |   GFLOPS   | Status
------+--------+-----------+--------+------------+-------------+------------+------------+--------
 dot  |     24 |        48 |    128 |    2000000 |    96000000 |        ... |        ... |    PASS
 dot  |     24 |        96 |    128 |    2000000 |   192000000 |        ... |        ... |    PASS
 dot  |     24 |       192 |    128 |    2000000 |   384000000 |        ... |        ... |    PASS
 dot  |     24 |       384 |    128 |    2000000 |   768000000 |        ... |        ... |    PASS
 ...
 dot  |     24 |        48 |   1024 |    2000000 |    96000000 |        ... |        ... |    PASS
 dot  |     24 |        96 |   1024 |    2000000 |   192000000 |        ... |        ... |    PASS
 dot  |     24 |       192 |   1024 |    2000000 |   384000000 |        ... |        ... |    PASS
 dot  |     24 |       384 |   1024 |    2000000 |   768000000 |        ... |        ... |    PASS

 16/16 passed

...

All benchmarks completed.

The example runs multiple test suites including numSub sweeps, taskCount sweeps, dimension sweeps (64–8192), iteration sweeps, task ratio sweeps for all four operations, and a cross-product dimension × task ratio matrix.

knn

VPE-accelerated KNN search with L2 and dot product distance.

./knn/knn_vpe_host          # run with defaults
./knn/knn_vpe_host --help   # see all options

Options:

Option Default Description
--dim N 128 Vector dimension
--vectors N 960000 Number of data points
--k N 5 Number of nearest neighbors

Environment Variables:

Variable Default Description
DEVICEID 0 Device ID
NUMSUB 24 Number of subsystems
TASKCOUNT 48 Number of subtasks for data partitioning

Sample output:

XArith KNN Example

Config: dim=128 points=960000 k=5 tasks=48 subs=24 (20000 points/task)

KNN VPE L2 Distance Test
[VPE]    Warmup (1 run): ...
[VPE]    Elapsed (5 runs): ...
[VPE]    min=... median=... max=...
[Result] Top-5:
  [0] label=... dist=...
  [1] label=... dist=...
  ...
[Verify] PASSED

KNN VPE Dot Product Test
[VPE]    Warmup (1 run): ...
[VPE]    Elapsed (5 runs): ...
[VPE]    min=... median=... max=...
[Result] Top-5:
  [0] label=... dist=...
  [1] label=... dist=...
  ...
[Verify] PASSED

2/2 tests passed

Tips

Buffer Allocation

Always check allocation results before using buffers:

uint64_t buf = ctx.allocateBuffer();
if (buf == mu::vdma::VpeContext::INVALID_BUFFER) {
    return;  // allocation failed
}

VpeContext supports up to 32 buffer slots, where each slot holds one FP32 vector of the configured dimension. Query availability with getAvailableBufferCount() before bulk allocation.

When multiple buffers are needed, use tryAllocateBuffers() for atomic all-or-nothing allocation. This prevents deadlock when multiple tasks compete for limited SRAM slots:

uint64_t bufs[3];
while (!ctx.tryAllocateBuffers(3, bufs)) {
    // retry until all 3 buffers are available
}

// use bufs[0], bufs[1], bufs[2] ...

ctx.freeBuffers(3, bufs);  // release all at once

For single-buffer cases, allocateBuffer() / freeBuffer() still work as before.

Performance

  • Minimize DRAM transfers – Keep data in SRAM buffers as long as possible; batch operations before storing results back
  • Choose VpeIdStrategy – Use ByThreadId for parallelism within an MU, ByClusterId for parallelism across clusters