qpu-xla | Raspberry Pi 5 QPU ML kernels

Runtime Stack

qpu-xla treats the Raspberry Pi 5 QPU as a practical edge ML execution substrate. The codebase keeps upstream py-videocore7 as a dependency and focuses this repository on reusable kernels, operator experiments, and paper reproduction scripts.

Reported Metrics

21.67 GOPS packed INT16-in / INT32-acc GEMM

94.38x best GEMM speedup over NumPy

18.60 GOPS INT32 Conv2D execute-only

14.41 GOPS INT32 attention execute-only

Integer GEMM throughput, GOPS
Size	INT32 NumPy	INT32 QPU	INT32 speedup	INT16-in QPU	INT16-in speedup
256	1.18	6.30	5.34x	6.30	5.40x
512	0.57	16.69	29.49x	16.82	29.70x
768	0.57	20.49	36.25x	20.49	33.91x
1024	0.21	10.73	50.15x	21.67	94.38x

Run the Scripts

Use Raspberry Pi 5 hardware with `/dev/dri/card0` access. PyTorch is optional and only needed for selected CPU baseline comparisons.

uv sync
uv run examples/igemm.py
uv run examples/igemm_int16.py
uv run examples/minmax.py
uv run examples/pool2d.py
uv run examples/tiledconv2d.py
uv run examples/tiledattention.py
uv run examples/tiledlenet5.py

Paper Timeline

SubmissionTBD
AcceptanceTBD
Camera-readyTBD
PresentationMLSys 2026 YPS, Bellevue, WA

Roadmap

Broader QPU operator coverage for edge inference.
CPU/QPU scheduling for mixed workloads.
A stable persistent runtime API above the current scripts.
Lightweight end-to-end LLM inference experiments.
Reproducible benchmark harnesses for cached and execute-only timing.