MLSys 2026 Young Professionals Symposium

QPU-first ML kernels for Raspberry Pi 5

A small ML runtime stack for VideoCore VII: tiled integer GEMM, GEMM-backed convolution, pooling, attention-style kernels, and persistent executors built on py-videocore7.

Preview of the qpu-xla MLSys 2026 YPS paper
Toward a Small ML Runtime Stack for Raspberry Pi 5 QPUs

Runtime Stack

qpu-xla treats the Raspberry Pi 5 QPU as a practical edge ML execution substrate. The codebase keeps upstream py-videocore7 as a dependency and focuses this repository on reusable kernels, operator experiments, and paper reproduction scripts.

Reported Metrics

21.67 GOPS packed INT16-in / INT32-acc GEMM
94.38x best GEMM speedup over NumPy
18.60 GOPS INT32 Conv2D execute-only
14.41 GOPS INT32 attention execute-only
Integer GEMM throughput, GOPS
Size INT32 NumPy INT32 QPU INT32 speedup INT16-in QPU INT16-in speedup
2561.186.305.34x6.305.40x
5120.5716.6929.49x16.8229.70x
7680.5720.4936.25x20.4933.91x
10240.2110.7350.15x21.6794.38x

Run the Scripts

Use Raspberry Pi 5 hardware with `/dev/dri/card0` access. PyTorch is optional and only needed for selected CPU baseline comparisons.

uv sync
uv run examples/igemm.py
uv run examples/igemm_int16.py
uv run examples/minmax.py
uv run examples/pool2d.py
uv run examples/tiledconv2d.py
uv run examples/tiledattention.py
uv run examples/tiledlenet5.py

Paper Timeline

  1. SubmissionTBD
  2. AcceptanceTBD
  3. Camera-readyTBD
  4. PresentationMLSys 2026 YPS, Bellevue, WA

Roadmap

  • Broader QPU operator coverage for edge inference.
  • CPU/QPU scheduling for mixed workloads.
  • A stable persistent runtime API above the current scripts.
  • Lightweight end-to-end LLM inference experiments.
  • Reproducible benchmark harnesses for cached and execute-only timing.