Kernel Execution

A Map represents a single kernel launch. It binds a Function to a taskCount and to a set of arguments, then runs the kernel across the MU cores owned by its parent Job.

auto map = job->buildMap("my_kernel", taskCount);
map->execute(arg0, arg1, arg2);
map->synchronize();

This page covers how a Map distributes work, how arguments are passed, and the lifecycle of a single execution.


taskCount and batchSize

Two numbers control how a launch is sized:

  • taskCount — the total number of times your kernel will be invoked. Set when the Map is built (buildMap(func, taskCount)). The maximum is pxl::MaxTaskCount ((1 << 20) - 1).
  • batchSize — the number of tasks each MU core processes back-to-back. Default: 16. Set with Map::setBatchSize(N).

The library divides taskCount into batches and gives one batch to each MU core:

batchCount   = ceil(taskCount / batchSize)
active cores ≤ min(batchCount, available MU cores in the Job)

The active core count is an upper bound — locality mode (see below) and other dispatch constraints may leave some cores idle even when batches are available.

The unit of distribution is the MU core, not the Sub. numSub only sets the size of the available core pool. Inside the pool, work is handed out per-core.

For the example below, assume the device exposes 128 MU cores per Sub, so a Job with 4 Subs has a pool of 512 cores:

taskCount batchSize batchCount Active cores Kernel invocations
1024 16 64 64 1024
1024 1 1024 512 (pool cap) 1024
100 1 100 100 100
8192 8 1024 512 (pool cap) 8192

The kernel is invoked exactly taskCount times regardless of batchSize. A larger batchSize reduces per-core entry overhead at the cost of fewer cores being active.

Reading the task index inside a kernel

Each kernel invocation gets a unique logical index in the range [0, taskCount-1]:

#include "mu/mu.hpp"

void my_kernel(int* data, int size) {
    auto idx = mu::getTaskIdx();        // [0, taskCount-1]
    auto total = mu::getTaskCount();    // == taskCount passed to buildMap
    // ...
}

For full kernel-side details (header, registration, parameter limits), see Kernel Programming Guide.


Argument Types

Map::execute(args...) accepts three kinds of arguments:

Kind When to use How PXL treats it
Constant Trivial scalar types (int, float, struct of PODs). Broadcast — every kernel invocation sees the same value.
Device pointer Pointer returned by Context::memAlloc(). Broadcast — every invocation sees the same pointer. The kernel typically uses mu::getTaskIdx() to compute its slice.
NDArray Typed, shaped view over a device buffer. Sliced — PXL hands each invocation its own NDArray view.

A struct of trivial POD members counts as a single Constant parameter — pass it as one argument and the whole struct is broadcast to every invocation. Total kernel parameters (counted this way) cannot exceed 9.

Launch with NDArray

NDArray<T> carries shape and stride information. PXL slices the array along its leading dimension and gives each MU core its slice:

Launch with NDArray

auto data = ctx->memAlloc(testCount * rowSize * sizeof(int));
auto arr  = pxl::NDArray<int>(static_cast<int*>(data), {testCount, rowSize});

auto map = job->buildMap("sort_with_ndarray", testCount);
map->execute(arr);
map->synchronize();

Launch with a device pointer

When you pass a raw device pointer or scalar, every invocation receives the same value. The kernel uses mu::getTaskIdx() to address its slice manually:

Launch with Pointer

auto map = job->buildMap("sort_with_ptr", testCount);
map->execute(static_cast<int*>(data), rowSize);

Locality Mode

Map::setLocalityMode(mode) controls how tasks are distributed across the Subs and Clusters that the Job owns.

  • CompactMode (default) — fill within a Sub first (cluster-first inside a Sub). Favors L2 reuse when neighbouring tasks share data.
  • SpreadMode — distribute across Subs first. Favors aggregate memory bandwidth when each task is independent and bandwidth-bound.

The tables below assume 4 Clusters per Sub for illustration:

numSub = 1:

Cluster CL0 CL1 CL2 CL3
Spread 0, 4, 8, 12 1, 5, 9, 13 2, 6, 10, 14 3, 7, 11, 15
Compact 0, 1, 2, 3 4, 5, 6, 7 8, 9, 10, 11 12, 13, 14, 15

numSub = 2:

  SUB0 / CL0 SUB0 / CL1 SUB0 / CL2 SUB0 / CL3 SUB1 / CL0 SUB1 / CL1 SUB1 / CL2 SUB1 / CL3
Spread 0, 8 2, 10 4, 12 6, 14 1, 9 3, 11 5, 13 7, 15
Compact 0, 1 2, 3 4, 5 6, 7 8, 9 10, 11 12, 13 14, 15
map->setLocalityMode(pxl::LocalityMode::SpreadMode);

The default CompactMode fits most workloads. Switch to SpreadMode if profiling shows the kernel is bandwidth-bound and would benefit from spreading evenly across Subs.


Execution Lifecycle

A single Map::execute() call moves the Map through a sequence of states. The current state is observable with Map::getExecuteStatus().

stateDiagram-v2
    direction LR
    [*] --> Idle
    Idle --> HostInit: execute()
    HostInit --> DeviceInit
    DeviceInit --> Request
    Request --> Waiting
    Waiting --> DeviceFinalize
    DeviceFinalize --> HostFinalize
    HostFinalize --> Idle
    Waiting --> Fail: device error
    Waiting --> Cancelled: cancel()
    Cancelled --> HostInit: execute()

Both Idle and Cancelled are reusable — calling execute() on a Map in either state starts a new run. Fail is terminal: the Map should not be re-executed after a device error.

State What happens
HostInit Host-side argument setup and host-to-device data sync.
DeviceInit Per-execute device-side initialization.
Request Tasks are dispatched to the device.
Waiting Host waits for device-side completion.
DeviceFinalize Device-side cleanup.
HostFinalize Device-to-host data sync, callback dispatch.
Idle Run finished cleanly. The Map is reusable.
Fail A device or runtime error occurred.
Cancelled A cancel() request was honored.

The Map::getProgress() helper returns target / issued / done packet counts for live progress reporting.


Synchronization, Cancellation, and Callbacks

Map::execute() is non-blocking — it enqueues work onto the Map’s stream and returns. It returns pxl::Result::Failure if the underlying stream is torn down or its consumer thread dies; check this when robustness against runtime tear-down matters:

if (map->execute(arg0, arg1) != pxl::Result::Success) {
    // stream torn down — abort or recover
}

Use one of the following to wait for or interrupt a run:

  • synchronize() — block until the Map reaches Idle, Cancelled, or Fail.
  • cancel() — soft stop. Skips not-yet-issued batches and waits for in-flight tasks to drain. Output is not synced back. The Map is reusable afterward.
  • Callbacks — register before calling execute():
map->setCompletionCallback([](void* arg) { /* success */ }, nullptr);
map->setMessageCallback   ([](void* msg, void* arg) { /* device message */ }, nullptr);
map->setErrorCallback     ([](void* arg) { /* failure */ }, nullptr);

To pipeline multiple kernel launches, issue several execute() calls in sequence and place a single synchronize() at the end.


→ Related: Programming Objects, Streams