Device Topology

An XCENA device exposes its compute resources as a three-level hierarchy. Knowing this layout makes the rest of PXL — Job, Map, taskCount, locality mode — easier to reason about.

Hierarchy

flowchart TD
    Device[XCENA Device]
    Device --> Sub0[Sub 0]
    Device --> SubN[Sub N]
    Sub0 --> Cluster0[Cluster 0]
    Sub0 --> ClusterM[Cluster M]
    SubN --> SubNDots[same structure]
    Cluster0 --> Core0[MU core 0]
    Cluster0 --> CoreK[MU core k]

Layer	What it is
Sub	The unit of compute reservation. A `Job` is created with a number of Subs and owns them for its lifetime.
Cluster	A group of MU cores inside a Sub that share an L2 cache.
MU core	A RISC-V execution unit inside a Cluster. This is the smallest scheduling unit — each MU core runs your kernel function, and `mu::getTaskIdx()` is observed here.

The exact counts (Subs per device, Clusters per Sub, MU cores per Cluster) vary by device generation. Don’t hard-code these numbers — the hierarchy itself is what’s stable.

How parallelism maps onto the hierarchy

When you launch a Map:

The Job provides a pool of MU cores (across all Subs the Job owns).
PXL distributes the launch’s tasks across those cores.
Each MU core invokes your kernel function once per task it received.

The unit of distribution is the MU core, not the Sub. numSub only sets the size of the available core pool. See Kernel Execution for how taskCount and batchSize interact with this pool.

The next page introduces the host-side objects you use to reserve Subs, load kernels, and launch work onto this hierarchy.