pxcc — Heterogeneous C/C++ Compiler

pxcc is a single-source compiler for the XCENA platform. You write host code and device kernels in the same .cpp file; pxcc extracts the kernels, compiles them for the device, and embeds them into one host binary that loads them at runtime through PXL.

Experimental — available from SDK v1.4.8 pxcc is released as an experimental compiler. Command-line flags, the programming model, and supported C++ surface may change in future SDK releases without a deprecation period. For production builds that need a stable interface, continue to use the separate-kernel build flow shown in Hello Sort.


Why pxcc

The traditional SDK flow compiles the compute kernel separately into a .mubin file, which the host application then loads by path at runtime. pxcc removes that split — one source, one compiler invocation, one binary:

  Separate-kernel flow pxcc single-source flow
Kernel source mu_kernel/*.cpp, registered with MU_KERNEL_ADD same .cpp as host, marked __pxl_kernel__
Kernel binary a standalone .mubin file shipped and loaded by path embedded in the host executable — nothing extra to package or locate
Host references a kernel by name string (createFunction("name")) the function symbol (createFunction<name>()) — a normal C++ entity
What the symbol buys you code navigation (go-to-def / rename), compile-time type checking of launch args, template kernels (sort_with_ptr<int>), and function overloading
Includes & linking configured by hand mu_lib include path and -lpxl added automatically
Build steps build kernel → build host → run pxcc++ app.cpp -o app → run

A complete example

#include <algorithm>
#include <cstdio>
#include "mu/mu.hpp"
#include "pxl/pxl.hpp"

// Device kernel — extracted and compiled for the device by pxcc.
__pxl_kernel__ void sort_with_ptr(int* arr, int size)
{
    int idx = mu::getTaskIdx();
    std::sort(arr + idx * size, arr + idx * size + size);
}

int main()
{
    const int testCount = 2048, sortSize = 64;
    auto* data = pxl::allocateMemory<int>(0, testCount * sortSize);
    for (int i = 0; i < testCount; i++)
        for (int j = 0; j < sortSize; j++)
            data[i * sortSize + j] = sortSize - j;

    auto result = pxl::Launcher().execute<sort_with_ptr>(testCount, data, sortSize).run();
    if (result.status != pxl::Result::Success)
    {
        printf("FAIL: %s\n", result.errorMessage.c_str());
        pxl::releaseMemory(data);
        return 1;
    }

    // Verify every sub-array is sorted in ascending order.
    bool sorted = true;
    for (int i = 0; i < testCount && sorted; i++)
        for (int j = 1; j < sortSize; j++)
            if (data[i * sortSize + j - 1] > data[i * sortSize + j]) { sorted = false; break; }
    printf("%s\n", sorted ? "PASS" : "FAIL");

    pxl::releaseMemory(data);
    return sorted ? 0 : 1;
}
pxcc++ sort.cpp -o sort   # compile host + device and link in one step
sudo ./sort               # CXL device memory access requires elevated privileges

A runnable version of this example lives in example/experimental/heterogeneous_compile.cpp. A template-kernel variant lives in example/experimental/template_programming.cpp.


Key facts

  • Two drivers: pxcc (C) and pxcc++ (C++). They are drop-in compiler front-ends — set them as CMAKE_C_COMPILER / CMAKE_CXX_COMPILER.
  • Host code may use C++17/20/23; device code is C++17 or earlier. Set the device standard with -Xmu=-std=... if needed (see Device options below); the full rule is in Constraints.
  • Includes and linking are automatic. The mu_lib include path is added on compile and -lpxl is added on link — no extra CMake configuration needed.
  • Host-only code just works. A source file with no __pxl_kernel__ is compiled like an ordinary clang C++ translation unit.
  • Changing kernel code requires recompiling the binary. Because kernels are embedded in the host executable, there is no separate kernel artifact to rebuild and relink on its own — rebuild the program after a kernel change.

Programming model

pxcc lets host code (x86-64) and device code (MU) live in the same translation unit. You mark which functions belong to the device with annotations; pxcc splits the source, compiles each half with the right backend, and embeds the device binary into the host binary.

Annotations

Annotation Where the code runs
(none) Host only. An un-annotated function is compiled for the host and is not available on the device — calling it from kernel/device code is an error.
__pxl_kernel__ Device kernel — a launchable entry point.
__mu_device__ Device-only helper function. mu:: APIs are available.
__mu_shared__ Code shared by host and device. mu:: APIs are not available.

So an ordinary function stays on the host; to make code reachable from a kernel, annotate it __mu_device__ (device-only) or __mu_shared__ (host and device):

int hostUtil(int x);                    // no annotation → host only; not callable from a kernel
__mu_device__ int devUtil(int x);       // device only
__mu_shared__ int shared(int x);        // host and device

__pxl_kernel__ void k(int* a, int n)
{
    a[0] = devUtil(a[0]) + shared(a[0]); // OK
    // a[0] = hostUtil(a[0]);            // ERROR: host-only symbol on the device
}

A small end-to-end example using all three annotations together:

#include "mu/mu.hpp"

// Shared by both sides — plain logic, no mu:: calls.
__mu_shared__ int clamp(int v, int lo, int hi)
{
    return v < lo ? lo : (v > hi ? hi : v);
}

// Device-only helper — may call mu:: APIs.
__mu_device__ int taskBase(int size)
{
    return mu::getTaskIdx() * size;
}

// Kernel — the launchable entry point.
__pxl_kernel__ void scale(int* arr, int size, int factor)
{
    int base = taskBase(size);
    for (int i = 0; i < size; i++)
        arr[base + i] = clamp(arr[base + i] * factor, 0, 1000);
}
  • __mu_shared__ is only needed on the declaration (class __mu_shared__ ClassName { ... };); you do not annotate every member function individually.
  • __mu_shared__ functions cannot use mu:: APIs (they also compile for the host). Use __mu_device__ when you need mu::.
  • The raw form [[clang::annotate("mu::device")]] is accepted as an alternative to the __mu_device__ macro.

Multi-file projects

Kernels and shared code may span translation units. Declare cross-TU shared symbols with __mu_shared__ in a header and compile the files together:

pxcc++ -c kernels.cpp helpers.cpp   # compile each to <stem>.o in the current directory
pxcc++ kernels.o helpers.o -o app   # link
# or in one step:
pxcc++ kernels.cpp helpers.cpp -o app

Constraints

  • Device code is C++17 or earlier. Code that runs on the device (functions reachable from __pxl_kernel__) is limited to C++17; the device backend does not yet support later standards. Host code is unaffected and may use C++17, C++20, or C++23. The device standard can be set with -Xmu=-std=....
  • The mu:: namespace is available in __pxl_kernel__ and __mu_device__ code, but not in __mu_shared__ code.

Compilation and linking

pxcc accepts the familiar clang/gcc-style compile and link forms.

# Single file (CMake style)
pxcc++ -c input.cpp -o input.o      # compile only
pxcc++ input.o -o program           # link
pxcc++ input.cpp -o program         # compile + link in one step

# Multiple files
pxcc++ -c f1.cpp f2.cpp             # compile each to <stem>.o in the current directory
pxcc++ f1.cpp f2.cpp -o app         # compile + link several files

-o is always a single file, exactly as in clang/gcc. Passing -o together with -c and multiple source files is rejected (cannot specify -o when generating multiple output files); omit -o to get one <stem>.o per source in the current directory.

Two things happen automatically, so you do not pass them yourself:

  • the mu_lib include path is added on every compile, and
  • -lpxl is added on every link.

This is why the CMake setup needs nothing beyond setting the compiler and the C++ standard.

Device options — -Xmu=

Flags that should reach the device compiler are forwarded with -Xmu=<arg> (one flag per -Xmu=). Anything not prefixed with -Xmu= applies to the host side.

Form Effect on device compilation
-Xmu=-std=c++17 Set the device C++ standard (C++17 or earlier; host standard is set independently).
-Xmu=-O0-Xmu=-O3, -Xmu=-Os Device optimization level (default -O3).
-Xmu=-g Emit device debug info.
-Xmu=-I/path Add a device include search path.
-Xmu=-isystem -Xmu=/path Add a device-only system include path (not visible to the host compiler).
-Xmu=-DNAME=VALUE Define a device preprocessor macro.
-Xmu=-L/path Add a device library search path.
-Xmu=-lfoo Link an extra device library — e.g. -Xmu=-L/opt/xarith/lib -Xmu=-lxarith.
pxcc++ -Xmu=-g -Xmu=-O0 -c kernel.cpp -o kernel.o   # device debug build, no opt
pxcc++ -Xmu=-I/opt/inc -Xmu=-DDEBUG kernel.cpp -o app

Host compiler options

Flags without the -Xmu= prefix apply to the host compiler. The common ones:

Option Purpose
-I<dir> Add a host include directory.
-D<macro>[=value] Define a host preprocessor macro.
-U<macro> Undefine a host preprocessor macro.
--isystem <dir> Add a system include path to both host and device (use -Xmu=-isystem for device only).
--include <file> Force-include a header before the source.
-x <language> Set the source language (c or c++).
--std=<standard> Host C++ standard (default c++17). The host may use a newer standard than the device — see Constraints.
-MF <file> Write make-style dependency output to <file> (for incremental build systems).
-MT <target> Set the target name emitted in the dependency output.
-MQ <target> Like -MT, but quotes characters special to make.

Any flag pxcc does not recognize is forwarded to the host compiler unchanged.

Selecting the host compiler

pxcc resolves the host compiler in this order:

--host-compiler=<path>  >  PXCC_HOST_CXX / PXCC_HOST_CC  >  CXX / CC
                        >  clang in PATH  >  /usr/bin/clang
  • --host-compiler=<path> — use a specific host compiler for this invocation.
  • PXCC_HOST_CXX / PXCC_HOST_CC (env) — host C++ / C compiler for pxcc++ / pxcc.
  • CXX / CC (env) — fallback when the above are unset.
pxcc++ --host-compiler=/usr/bin/g++-12 -c app.cpp -o app.o
PXCC_HOST_CXX=/usr/bin/g++-12 pxcc++ app.cpp -o app

Diagnostics and intermediates

pxcc++ -save-temps -c input.cpp -o input.o   # keep intermediate files

-save-temps preserves the intermediate artifacts of the split-compile pipeline next to the output:

File Description
mu.{filename}.cpp Extracted device code.
host.{filename}.cpp Transformed host code.
mu_kernel.mubin Device binary.
mu_kernel.o Embedding object linked into the host binary.

These are the first things to inspect when a kernel does not behave as expected — see Troubleshooting.

To see what pxcc actually runs, use one of the verbose forms:

pxcc++ --pxcc-verbose -c input.cpp -o input.o   # pxcc's own internal command log
pxcc++ -v -c input.cpp -o input.o               # commands the host compiler runs
pxcc++ --dry-run -c input.cpp -o input.o        # print commands without executing (= -###)

CMake integration

set(CMAKE_CXX_COMPILER pxcc++)
set(CMAKE_CXX_STANDARD 17)

add_executable(myapp main.cpp kernel.cpp)
target_compile_options(myapp PRIVATE -Xmu=-O2)   # device options

Override the host compiler from CMake via the flag or an environment variable:

cmake -DCMAKE_CXX_COMPILER=pxcc++ -DCMAKE_CXX_FLAGS="--host-compiler=/usr/bin/g++-12" ..
# or
PXCC_HOST_CXX=/usr/bin/g++-12 cmake -DCMAKE_CXX_COMPILER=pxcc++ ..

mu_lib include paths and -lpxl linking are handled automatically.

Help and version

pxcc++ --help      # brief help with key features
pxcc++ --manual    # full man-page-style manual (all options, examples, troubleshooting)
pxcc++ --version   # version string (pxcc, mu_llvm, host/device targets)

Reference summary

Option Applies to Purpose
-c both Compile only; do not link.
-o <path> both Output file (always a single file).
-j <N> driver Max parallel compilation jobs (default: CPU cores).
-Xmu=<arg> device Forward <arg> to the device compiler.
--host-compiler=<path> driver Choose the host compiler.
--std=<standard> host Host C++ standard (default c++17).
-MF / -MT / -MQ host Make-style dependency output: file / target / quoted target.
-save-temps both Keep intermediate host/device sources and binaries.
--pxcc-verbose driver Log the commands pxcc executes.
-v host Show the commands the host compiler runs.
--dry-run, -### driver Print commands without executing.
--help / --version / --manual driver Brief help / version / full manual.