Hello Sort (Heterogeneous Programming)

Write a host program and a device kernel in one file, then build and run it with a single compiler invocation.

Experimental: pxcc is an experimental compiler. See the pxcc Compiler overview for the current status and caveats.

Overview

This tutorial mirrors Hello Sort — parallel sorting — but uses the pxcc single-source model instead of a separately compiled kernel. You will:

  • Mark a function as a device kernel with __pxl_kernel__
  • Launch it from host code with pxl::Launcher
  • Build host + device with pxcc++ and run the result

If you have not yet read PXL Key Concepts, do that first — this tutorial assumes you know what a task and device memory are.

1. Write the source

Create sort.cpp. The kernel and the host main() live in the same file.

#include <algorithm>
#include <cstdio>
#include "mu/mu.hpp"
#include "pxl/pxl.hpp"

// A device kernel is just a function annotated with __pxl_kernel__.
// Each parallel task sorts its own sub-array; mu::getTaskIdx() gives
// the index of the running task.
__pxl_kernel__ void sort_with_ptr(int* arr, int size)
{
    int idx = mu::getTaskIdx();
    int* base = arr + idx * size;
    std::sort(base, base + size);
}

int main()
{
    const int testCount = 2048;  // number of parallel tasks
    const int sortSize = 64;     // elements per task

    // Allocate device (CXL) memory and fill each array in descending order.
    auto* data = pxl::allocateMemory<int>(0, testCount * sortSize);
    for (int i = 0; i < testCount; i++)
        for (int j = 0; j < sortSize; j++)
            data[i * sortSize + j] = sortSize - j;

    // Launch one kernel instance per task and wait for completion.
    auto result = pxl::Launcher().execute<sort_with_ptr>(testCount, data, sortSize).run();
    if (result.status != pxl::Result::Success)
    {
        printf("Launch failed: %s\n", result.errorMessage.c_str());
        return 1;
    }
    printf("test done : %.2f us\n", result.elapsedUs);

    // Verify every array is now sorted in ascending order.
    for (int i = 0; i < testCount; i++)
        for (int j = 1; j < sortSize; j++)
            if (data[i * sortSize + j - 1] > data[i * sortSize + j])
            {
                printf("Verification failed at [%d][%d]\n", i, j);
                pxl::releaseMemory(data);
                return 1;
            }

    pxl::releaseMemory(data);
    return 0;
}

Note: Unlike the separate-kernel flow, there is no MU_KERNEL_ADD, no .mubin file, and no kernel path string. The host refers to the kernel by its symbol: execute<sort_with_ptr>(...).

2. Compile with pxcc

A single command compiles both the host and the device code and links them into one executable:

pxcc++ sort.cpp -o sort

The mu_lib include path is added automatically on compile, and -lpxl is added automatically on link — no extra flags required.

Prefer CMake? It is just as short:

cmake_minimum_required(VERSION 3.11)
project(sort)

set(CMAKE_CXX_COMPILER pxcc++)
set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED ON)

add_executable(sort sort.cpp)

3. Run

sudo ./sort   # CXL device memory access requires elevated privileges

Expected output:

test done : XXXX.XX us

The elapsed time varies by hardware.

If any array is not sorted correctly, the program prints Verification failed at [i][j] (with the offending coordinates) and exits non-zero.

How this differs from Hello Sort

  Separate-kernel Hello Sort                  pxcc Hello Sort

    host executable                               pxcc executable
    ┌──────────────────────┐                    ┌──────────────────────────┐
    │ host code            │                    │ host code                │
    │                      │                    │                          │
    │ createModule(...)    │                    │ __pxl_kernel__           │
    │ createFunction(...)  │                    │ sort_with_ptr            │
    │ buildMap(...)        │                    │                          │
    │ execute(...)         │                    │ Launcher.execute<...>()  │
    │ synchronize()        │                    └──────────────────────────┘
    └──────────┬───────────┘
               │
               │ loads external kernel file
               ▼
    ┌──────────────────────┐
    │ mu_kernel.mubin      │
    │ sort_with_ptr        │
    └──────────────────────┘

In the separate-kernel flow the host loads the kernel at runtime with createModule("…mubin") and resolves it with createFunction(...) / buildMap(...). With pxcc the kernel travels inside the executable and is launched by its function symbol with Launcher.execute<sort_with_ptr>().

Template kernels

A __pxl_kernel__ may be a function template, so one kernel can serve many element types — int, a fixed-size string (sort by name), or your own structs. This is new to the single-source model: the separate-kernel build flow (a kernel compiled to a standalone .mubin and loaded by name) could not express template kernels, because a name string resolves to exactly one compiled symbol.

template <typename T>
__pxl_kernel__ void sort_with_ptr(T* arr, int size)
{
    int idx = mu::getTaskIdx();
    std::sort(arr + idx * size, arr + idx * size + size);
}

// Launch — name the kernel with a concrete type.
auto r = pxl::Launcher().execute<sort_with_ptr<int>>(testCount, data, size).run();

Explicit type only. A template kernel must be named with a concrete type argument — sort_with_ptr<int>, sort_with_ptr<FixedString>, etc. pxcc does not deduce the type from the call arguments, so there is no implicit instantiation; launch (or createModule()->createFunction<…>()) once per type you need.

Element types must be trivially copyable — they live in device memory and are moved/swapped by the device sort. Declare a user-defined type struct __mu_shared__ so the type and its operators are available on the device. To sort by name, use a fixed-size character buffer rather than std::string (whose layout and heap pointers differ between the host and device standard libraries):

struct __mu_shared__ FixedString       // sort by name
{
    char data[16];
    bool operator<(const FixedString& rhs) const { /* lexicographic compare */ }
};

struct __mu_shared__ Point             // sort 2-D points by distance from origin
{
    int x, y;
    int dist2() const { return x * x + y * y; }
    bool operator<(const Point& rhs) const { return dist2() < rhs.dist2(); }
};
// ... then: pxl::Launcher().execute<sort_with_ptr<FixedString>>(...).run();

A runnable version is in example/experimental/template_programming.cpp. It selects the element type at run time with --template-type <int|string|custom_type> and sorts one type per run, verifying the device result against a host std::sort:

Build it the same way as the other bundled examples, then run one type per invocation (use sudo — the device sort touches CXL device memory):

cd example/experimental
./build.sh
sudo ./template_programming --template-type int
sudo ./template_programming --template-type string
sudo ./template_programming --template-type custom_type   # custom struct (Point)

One type is sorted per process because each run issues a single kernel launch.

Next steps

  • Compiler — the annotation model (__pxl_kernel__, __mu_device__, __mu_shared__), options, device flags, and linking.
  • Troubleshooting — common build errors.