vla.cpp: A Unified Inference Runtime for Vision-Language-Action Models

A portable C++ runtime built on llama.cpp for serving flow-matching and diffusion VLA policies from compact model bundles across workstation and embedded robot hardware.

Anonymous Authors

Appendix Code coming soon Models coming soon BibTeX

Code, converters, model bundles, and benchmark scaffolds will be linked here after release.

ggml-class VLA runtime To our knowledge, the first native flow-matching and diffusion action-head server on llama.cpp.
One binary, seven architectures Shared request/response protocol and bundle format across VLA families.
Edge deployment focus Measured from RTX 3060 down to Jetson modules with an 8 GB unified-memory floor.

Abstract

Vision-language-action policies are typically shipped as Python/PyTorch stacks that assume a workstation-class GPU, which mismatches the embedded hardware robots actually carry. vla.cpp is a portable C++ inference runtime that serves the VLA pattern in which a cached vision-language prefix is consumed by a cross-attending action expert over several solver steps.

One runtime serves seven architectures spanning six backbone families and five action-head implementations behind one request/response protocol. On LIBERO-Object, the engine matches reference behavior within one episode out of 200, runs BitVLA at 100% success in 1.3 GiB of memory, and carries the same bundle from an RTX 3060 down to an 8 GB embedded module. A cross-hardware roofline analysis shows that batch-1 VLA inference is compute-bound; an IMMA ladder GEMM derived from that analysis cuts BitVLA per-step latency by 4.5x.

VLA inference llama.cpp GGUF bundles LIBERO-Object ALOHA arm edge deployment roofline analysis

System Architecture

vla.cpp runtime architecture showing client observations, a stateless server, a GGUF bundle, and action chunks.
Unified C++ serving path for flow-matching and diffusion VLA action heads.

Highlights

7 VLA architectures
6 backbone families
5 action-head implementations
1.3 GiB BitVLA runtime footprint
7.95x SmolVLA step-latency speedup
87.5% ALOHA closed-loop success

Key Contributions

01

Unified native runtime

A llama.cpp-based C++ server loads one self-contained bundle format and serves flow-matching or diffusion VLA action heads through a shared request/response path.

GGUF metadata ZeroMQ serving Protobuf actions
02

Deployment-scale benchmarking

LIBERO-Object, memory-tier, latency, and roofline measurements show how the same runtime behaves from RTX 3060 workstations down to Jetson-class robot hardware.

7 architectures 6 backbones 5 action-head implementations
03

Real-arm evidence path

The ALOHA section pairs pytorch and vla.cpp trials by task and trial number, with setup photos, aggregate success rates, latency snapshots, and upload slots for the full evidence set.

40 trials per engine task-level SR paired videos

Method

vla.cpp runs as a stateless C++ server paired with a lightweight client. The server loads a self-contained GGUF bundle once, decodes observations from a Protobuf request, runs VLA inference, denormalizes actions, and returns an action chunk over ZeroMQ.

Unified control flow

Flow-matching and diffusion action heads map onto one VLA inference path instead of separate per-model serving stacks.

Reusable prefix cache

The vision-language prefix is encoded once per observation and reused by cross-attention across solver steps.

Bundle-local metadata

Model weights, architecture configuration, and action/state normalization statistics travel with the runtime artifact.

Leaderboard

The main success leaderboards are split by benchmark: LIBERO-Object simulation reports one aggregate success rate over 200 episodes per model, while the real ALOHA arm reports task-level and overall success over the available trial set for each engine. Latency, roofline, memory, and video evidence are supporting deployment measurements.

LIBERO-Object leaderboard Real ALOHA leaderboard ALOHA video evidence Latency and roofline Memory tiers

LIBERO-Object Simulator Leaderboard

Rank Model Backbone Chunk SR (200 eps) Step Inference VRAM
1BitVLABitNet-SigLIP8100.0%37.85 ms235.9 ms1312 MiB
2GR00T-N1.7Cosmos1698.0%10.26 ms84.1 ms6302 MiB
3GR00T-N1.6Eagle1686.5%10.29 ms83.6 ms6048 MiB
4GR00T-N1.5Eagle1696.0%14.17 ms147.0 ms4866 MiB
5Evo-1InternVL3894.5%63.60 ms131.0 ms1564 MiB
6SmolVLASmolVLM2490.5%28.16 ms54.8 ms1410 MiB
7pi0PaliGemma3287.5%9.74 ms207.2 ms5548 MiB

This table has one SR column because the current paper reports the aggregate LIBERO-Object suite result: 10 tasks x 20 episodes = 200 episodes per architecture.

LIBERO Rollouts

Example simulator rollouts produced by the vla.cpp runtime. These sit with the LIBERO leaderboard because they illustrate simulator behavior rather than real ALOHA-arm execution.

LIBERO rollout with carrot and bowl.

LIBERO rollout

LIBERO rollout at a sink scene.

LIBERO rollout

Real ALOHA Arm Leaderboard

Rank Engine Policy Average SR task 1 task 2 Task time Inference/chunk
1vla.cppGR00T-N1.6 BF1635/40 (87.5%)18/20 (90%)17/20 (85%)49 s / 28 s~470 ms
2pytorch baselineGR00T-N1.6 BF1616/40 (40.0%)3/20 (15%)13/20 (65%)71 s / 30 s~584-688 ms

Pass rate is taken from the source worksheet; average time is calculated from successful executions only. pytorch and vla.cpp each have 20 trials per task.

Real ALOHA Left Arm Benchmark

GR00T N1.6 BF16 inference comparison on two physical manipulation tasks. The visualization separates success rate, successful-trial time, server latency, and uploaded trial evidence.

task 1 ALOHA setup with colored blocks and a white box on the table.
task 1

pick up all blocks on the table and place them into the white box.

task 2 ALOHA setup with trash, banana, box, and white dish on the table.
task 2

pick up the trash and place it into the box, then pick up the banana and place it on the white dish.

pytorch average success rate 40.0%

16/40 successful trials across task 1 and task 2.

vla.cpp average success rate 87.5%

35/40 successful trials across task 1 and task 2.

best combined pass rate 87.5%

vla.cpp-BF16, 35/40 trials

best task 1 90%

vla.cpp-BF16, 18/20 blocks-to-box trials

best task 2 85%

vla.cpp-BF16, 17/20 trash-and-banana trials

fastest task 2 avg. 0m28s

vla.cpp-BF16, successful trials only

pass rate by task

pytorch task 1
15%
vla.cpp task 1
90%
pytorch task 2
65%
vla.cpp task 2
85%

average successful time

pytorch task 1
1m11s
vla.cpp task 1
0m49s
pytorch task 2
0m30s
vla.cpp task 2
0m28s
pytorch baseline 584-688 ms
vla.cpp-BF16 451-507 ms

ALOHA Experiment Videos

Each page shows one trial: one setup image plus external camera overview, camera wrist left, and camera high videos for pytorch and vla.cpp.

0 trials
page 1 of 1

pytorch reference comparison

On SmolVLA BF16 over LIBERO-Object, vla.cpp reduces environment-observed step latency from 223.96 ms to 28.16 ms while keeping peak VRAM essentially unchanged.

pytorch
223.96 ms, 1406 MiB
vla.cpp
28.16 ms, 1410 MiB

IMMA Tensor-Core Kernel

Moving BitVLA ternary matrix multiplication from a DP4A path to IMMA tensor cores cuts per-step latency by 4.6x on RTX 3060 and 4.0x on AGX Orin while preserving numerical output.

RTX 3060
172.8 ms to 37.85 ms
AGX Orin
406.6 ms to 101.11 ms
Roofline plots placing single-request VLA inference in the compute-bound region on RTX 3060 and Jetson AGX Orin.
The roofline analysis places batch-1 VLA inference in the compute-bound regime, making utilization the deployment lever for latency after footprint is controlled.

Hardware Tiers

Model RTX 3060 AGX Orin Orin Nano Nano RSS
SmolVLA28.16 ms65.41 ms141.81 ms2031 MiB
BitVLA37.85 ms101.11 ms355.65 ms2199 MiB
Evo-163.60 ms131.01 ms458.84 ms2135 MiB
GR00T-N1.514.17 ms28.78 ms84.76 ms*5975 MiB
pi09.74 ms27.90 ms39.10 ms*6068 MiB
GR00T-N1.610.29 ms26.70 msDoes not fit-
GR00T-N1.710.26 ms26.84 msDoes not fit-

* Run split with simulator offloaded to another machine.

Memory Efficiency

The smallest robot target is an 8 GB Jetson Orin Nano, where the model must share unified memory with the OS, camera stack, simulator or robot client, and the rest of the autonomy pipeline.

BitVLA bundle 1.34 GiB

Packed ternary weights reduce the on-disk model from 5.6 GiB while keeping the same numerical values at inference.

Load high-water mark 1.15 GiB

The packed layout drops load-time memory pressure enough for BitVLA to run on the Nano-class target.

Nano coverage 5 / 7

Five architectures fit the 8 GB tier; the largest GR00T variants still exceed the shared-memory budget.

Comparison

The project targets runtime portability and utilization. Quantization, vendor compilers, and Python reference servers solve adjacent parts of the deployment problem.

Python reference servers

Reference stacks preserve research flexibility, but eager pytorch dispatch leaves batch-1 VLA inference launch-bound. On SmolVLA, vla.cpp reduces step latency from 223.96 ms to 28.16 ms on the same RTX 3060 setup.

Quantization-only deployment

Weight packing controls footprint, but single-request VLA inference remains compute-bound. The BitVLA speedup comes from moving the same ternary math onto IMMA tensor cores, not from changing the packed format alone.

Single-target compilers

Vendor compilers can be strong for one model on one device generation. vla.cpp instead emphasizes one bundle protocol and one serving path across several VLA architectures and hardware tiers.

Citation

@article{anonymous2026vlacpp,
  title={vla.cpp: A Unified Inference Runtime for Vision-Language-Action Models},
  author={Anonymous},
  year={2026}
}