Runtime Stack
qpu-xla treats the Raspberry Pi 5 QPU as a practical edge ML execution substrate. The codebase keeps upstream py-videocore7 as a dependency and focuses this repository on reusable kernels, operator experiments, and paper reproduction scripts.
Reported Metrics
| Size | INT32 NumPy | INT32 QPU | INT32 speedup | INT16-in QPU | INT16-in speedup |
|---|---|---|---|---|---|
| 256 | 1.18 | 6.30 | 5.34x | 6.30 | 5.40x |
| 512 | 0.57 | 16.69 | 29.49x | 16.82 | 29.70x |
| 768 | 0.57 | 20.49 | 36.25x | 20.49 | 33.91x |
| 1024 | 0.21 | 10.73 | 50.15x | 21.67 | 94.38x |
Run the Scripts
Use Raspberry Pi 5 hardware with `/dev/dri/card0` access. PyTorch is optional and only needed for selected CPU baseline comparisons.
uv sync
uv run examples/igemm.py
uv run examples/igemm_int16.py
uv run examples/minmax.py
uv run examples/pool2d.py
uv run examples/tiledconv2d.py
uv run examples/tiledattention.py
uv run examples/tiledlenet5.py
Paper Timeline
- SubmissionTBD
- AcceptanceTBD
- Camera-readyTBD
- PresentationMLSys 2026 YPS, Bellevue, WA
Roadmap
- Broader QPU operator coverage for edge inference.
- CPU/QPU scheduling for mixed workloads.
- A stable persistent runtime API above the current scripts.
- Lightweight end-to-end LLM inference experiments.
- Reproducible benchmark harnesses for cached and execute-only timing.