Hello Sort (Heterogeneous Programming)
Write a host program and a device kernel in one file, then build and run it with a single compiler invocation.
Experimental: pxcc is an experimental compiler. See the pxcc Compiler overview for the current status and caveats.
Overview
This tutorial mirrors Hello Sort — parallel sorting — but uses the pxcc single-source model instead of a separately compiled kernel. You will:
- Mark a function as a device kernel with
__pxl_kernel__ - Launch it from host code with
pxl::Launcher - Build host + device with
pxcc++and run the result
If you have not yet read PXL Key Concepts, do that first — this tutorial assumes you know what a task and device memory are.
1. Write the source
Create sort.cpp. The kernel and the host main() live in the same file.
#include <algorithm>
#include <cstdio>
#include "mu/mu.hpp"
#include "pxl/pxl.hpp"
// A device kernel is just a function annotated with __pxl_kernel__.
// Each parallel task sorts its own sub-array; mu::getTaskIdx() gives
// the index of the running task.
__pxl_kernel__ void sort_with_ptr(int* arr, int size)
{
int idx = mu::getTaskIdx();
int* base = arr + idx * size;
std::sort(base, base + size);
}
int main()
{
const int testCount = 2048; // number of parallel tasks
const int sortSize = 64; // elements per task
// Allocate device (CXL) memory and fill each array in descending order.
auto* data = pxl::allocateMemory<int>(0, testCount * sortSize);
for (int i = 0; i < testCount; i++)
for (int j = 0; j < sortSize; j++)
data[i * sortSize + j] = sortSize - j;
// Launch one kernel instance per task and wait for completion.
auto result = pxl::Launcher().execute<sort_with_ptr>(testCount, data, sortSize).run();
if (result.status != pxl::Result::Success)
{
printf("Launch failed: %s\n", result.errorMessage.c_str());
return 1;
}
printf("test done : %.2f us\n", result.elapsedUs);
// Verify every array is now sorted in ascending order.
for (int i = 0; i < testCount; i++)
for (int j = 1; j < sortSize; j++)
if (data[i * sortSize + j - 1] > data[i * sortSize + j])
{
printf("Verification failed at [%d][%d]\n", i, j);
pxl::releaseMemory(data);
return 1;
}
pxl::releaseMemory(data);
return 0;
}
Note: Unlike the separate-kernel flow, there is no
MU_KERNEL_ADD, no.mubinfile, and no kernel path string. The host refers to the kernel by its symbol:execute<sort_with_ptr>(...).
2. Compile with pxcc
A single command compiles both the host and the device code and links them into one executable:
pxcc++ sort.cpp -o sort
The mu_lib include path is added automatically on compile, and -lpxl is added automatically on link — no extra flags required.
Prefer CMake? It is just as short:
cmake_minimum_required(VERSION 3.11)
project(sort)
set(CMAKE_CXX_COMPILER pxcc++)
set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED ON)
add_executable(sort sort.cpp)
3. Run
sudo ./sort # CXL device memory access requires elevated privileges
Expected output:
test done : XXXX.XX us
The elapsed time varies by hardware.
If any array is not sorted correctly, the program prints Verification failed at [i][j] (with the offending coordinates) and exits non-zero.
How this differs from Hello Sort
Separate-kernel Hello Sort pxcc Hello Sort
host executable pxcc executable
┌──────────────────────┐ ┌──────────────────────────┐
│ host code │ │ host code │
│ │ │ │
│ createModule(...) │ │ __pxl_kernel__ │
│ createFunction(...) │ │ sort_with_ptr │
│ buildMap(...) │ │ │
│ execute(...) │ │ Launcher.execute<...>() │
│ synchronize() │ └──────────────────────────┘
└──────────┬───────────┘
│
│ loads external kernel file
▼
┌──────────────────────┐
│ mu_kernel.mubin │
│ sort_with_ptr │
└──────────────────────┘
In the separate-kernel flow the host loads the kernel at runtime with createModule("…mubin") and resolves it with createFunction(...) / buildMap(...). With pxcc the kernel travels inside the executable and is launched by its function symbol with Launcher.execute<sort_with_ptr>().
Template kernels
A __pxl_kernel__ may be a function template, so one kernel can serve many element types — int, a fixed-size string (sort by name), or your own structs. This is new to the single-source model: the separate-kernel build flow (a kernel compiled to a standalone .mubin and loaded by name) could not express template kernels, because a name string resolves to exactly one compiled symbol.
template <typename T>
__pxl_kernel__ void sort_with_ptr(T* arr, int size)
{
int idx = mu::getTaskIdx();
std::sort(arr + idx * size, arr + idx * size + size);
}
// Launch — name the kernel with a concrete type.
auto r = pxl::Launcher().execute<sort_with_ptr<int>>(testCount, data, size).run();
Explicit type only. A template kernel must be named with a concrete type argument —
sort_with_ptr<int>,sort_with_ptr<FixedString>, etc. pxcc does not deduce the type from the call arguments, so there is no implicit instantiation; launch (orcreateModule()->createFunction<…>()) once per type you need.
Element types must be trivially copyable — they live in device memory and are moved/swapped by the device sort. Declare a user-defined type struct __mu_shared__ so the type and its operators are available on the device. To sort by name, use a fixed-size character buffer rather than std::string (whose layout and heap pointers differ between the host and device standard libraries):
struct __mu_shared__ FixedString // sort by name
{
char data[16];
bool operator<(const FixedString& rhs) const { /* lexicographic compare */ }
};
struct __mu_shared__ Point // sort 2-D points by distance from origin
{
int x, y;
int dist2() const { return x * x + y * y; }
bool operator<(const Point& rhs) const { return dist2() < rhs.dist2(); }
};
// ... then: pxl::Launcher().execute<sort_with_ptr<FixedString>>(...).run();
A runnable version is in example/experimental/template_programming.cpp. It selects the element type at run time with --template-type <int|string|custom_type> and sorts one type per run, verifying the device result against a host std::sort:
Build it the same way as the other bundled examples, then run one type per invocation (use sudo — the device sort touches CXL device memory):
cd example/experimental
./build.sh
sudo ./template_programming --template-type int
sudo ./template_programming --template-type string
sudo ./template_programming --template-type custom_type # custom struct (Point)
One type is sorted per process because each run issues a single kernel launch.
Next steps
- Compiler — the annotation model (
__pxl_kernel__,__mu_device__,__mu_shared__), options, device flags, and linking. - Troubleshooting — common build errors.