Skip to content
llm-speed

Glossary

Glossary of llm-speed terms

Every term llm-speed uses when reporting a benchmark, defined in one sentence first and elaborated below. Cross-reference /methodology for workload specs and /privacy for data-handling terms.

Metrics & timing

tok/stokens per second
tok/s = how fast text streams onto the screen. 10 tok/s reads slower than you. 60 tok/s matches your reading speed. 200+ tok/s is faster than you can follow.
Technically, tok/s (tokens per second) is the rate at which a language model produces or consumes tokens during a benchmark. On llm-speed, tok/s without further qualification refers to decode tok/s — the streaming-output rate users care about during chat.
decode tok/s
decode tok/s is the rate, in tokens per second, at which a model emits new tokens during generation; it is memory-bandwidth-bound and runs one token at a time per stream.
llm-speed measures decode tok/s as wall-clock output_tokens divided by decode wall time, batch size 1 unless the workload (concurrent-decode) says otherwise.
prefill tok/s
prefill tok/s is the rate, in tokens per second, at which a model ingests the prompt before generation begins; it is compute-bound and parallelizes across the prompt length.
Prefill tok/s is typically 10x to 100x higher than decode tok/s on the same hardware. A long prompt with high prefill throughput still produces a low tok/s end-to-end if decode is the bottleneck.
TTFTtime-to-first-token
TTFT, or time-to-first-token, is the wall-clock duration in milliseconds between submitting a prompt and receiving the first generated token.
On local backends, TTFT is dominated by prefill. On hosted APIs it adds queueing and network round-trip time.
p50
p50, the 50th percentile, is the median per-token decode latency in milliseconds; half of decoded tokens take less than this and half take longer.
p95
p95, the 95th percentile, is the per-token decode latency in milliseconds below which 95% of decoded tokens fall; it captures tail latency that the median hides.
wall_ms
wall_ms is the end-to-end wall-clock duration of a workload in milliseconds, from request start to last token received.
agent-trace TTFT
agent-trace TTFT is the per-turn time-to-first-token captured during the agent-trace workload (W5); it is reported alongside per-turn decode tok/s, end-to-end wall time, and prefix-cache hit rate, so a backend whose first turn is fast but later turns regress (because the prefix cache is not hit) is visible from a single row.

Workloads & internals

suite-v1
suite-v1 is the current versioned llm-speed workload suite, comprising chat-short, chat-long, long-context-decay, concurrent-decode, and agent-trace; the suite_version field on every result identifies which protocol produced it.
Future suite versions (suite-v2, etc.) will not invalidate older results; runs stay accessible and comparable within their own version.
chat-short
chat-short (W1) is the baseline llm-speed workload: a 128-token natural-language prompt and a 256-token completion at batch size 1.
chat-long
chat-long (W2) is the long-prompt llm-speed workload: a 4,096-token prompt assembled from fixture paragraphs and a 1,024-token completion at batch size 1, designed to surface prefill scaling.
long-context-decay
long-context-decay (W3) is the opt-in llm-speed workload that re-runs the same model at 32k, 64k, and 128k input context (where supported), each producing a 256-token completion, to capture how prefill and decode degrade with context length.
concurrent-decode
concurrent-decode (W4) is the llm-speed workload that runs multiple simultaneous decode streams at batch sizes 1, 4, 8, and 16, reporting aggregate throughput, per-stream throughput, and p50/p95 latency at each batch.
Backends without true concurrent decoding (for example, MLX which runs serially in-process) are flagged on the result so the number is not misread as parallel throughput.
agent-trace
agent-trace (W5) is the llm-speed workload that replays a fixed multi-turn coding-agent trace with realistic tool calls (read-file, write-file, run-tests, fix), growing the context to about 16k tokens by the final turn, to measure inference speed under the workload that matters for daily-driver agentic LLM use.
batch size
batch size is the number of independent decode streams a backend runs in parallel; in the chat-short, chat-long, long-context-decay, and agent-trace workloads it is 1, while concurrent-decode sweeps batch 1, 4, 8, and 16.
context length
context length is the number of tokens of input the model attends to in a single forward pass; the long-context-decay workload measures how prefill and decode degrade at 32k, 64k, and 128k context.
prefix cache
prefix cache, also known as prompt cache, is a backend feature that stores the KV-cache state for a prompt prefix so subsequent requests sharing that prefix skip prefill on the cached portion.
llm-speed records prefix_cache_hit_rate when the backend exposes it, so agent-trace numbers are interpretable on backends that cache between turns and on backends that do not.
prefill
prefill is the phase of LLM inference in which the model processes the input prompt to populate its KV cache before any output tokens are generated.
decode
decode is the phase of LLM inference in which the model autoregressively produces output tokens, one token per forward pass, after prefill has completed.
KV cachekey/value cache
the KV cache, short for key/value cache, is the per-layer attention state a transformer accumulates during prefill so that decode can produce each new token without recomputing attention over the full prompt.
KV cache size grows linearly with context length and is the primary reason decode tok/s falls at long context.

Quantization & integrity

Q4_K_M
Q4_K_M is a llama.cpp / GGUF quantization scheme that stores most weights at 4 bits with k-quant blocks and selectively keeps important tensors at higher precision; it is a common balance of size and quality for local LLM inference.
Q4_K_M context
Q4_K_M context is shorthand for the working-set size of a Q4_K_M-quantized model plus its KV cache at the configured context length; for a 70B model the weights alone are roughly 38 GB, with KV cache adding 0.4 GB per 1k tokens at standard transformer dimensions.
model digest
the model digest is the SHA-256 prefix of the weights file used for a benchmark, recorded so that two 'Llama 3.3 70B Q4_K_M' results from different sources are not blended if one used a different quantization pipeline.
EdDSAEdwards-curve Digital Signature Algorithm
EdDSA, the Edwards-curve Digital Signature Algorithm, is the digital-signature scheme llm-speed uses (specifically Ed25519) to sign every benchmark upload so the bytes cannot be tampered with after signing.
JWSJSON Web Signature (RFC 7515)
JWS, JSON Web Signature defined by RFC 7515, is the compact-token format llm-speed wraps each upload in; the public key rides in the protected header so any third party can verify the signature without the server.
fingerprint
the llm-speed hardware fingerprint is a SHA-256 hash over bucketed hardware fields (CPU model, core count, RAM rounded to 8 GB, GPU name, VRAM rounded to 8 GB, OS major version) that identifies a hardware class, not a device.
Two physically different machines with the same SoC, RAM bucket, and major OS produce the same fingerprint hash. The hash is omitted entirely in --strict-anon mode.
fingerprint hash
the fingerprint hash is the 16-hex-character prefix of the llm-speed hardware fingerprint, used server-side to group runs of the same hardware class for outlier detection.

Backends, hosts & CLI flags

MLX
MLX is Apple's Metal-native array framework for machine learning on Apple Silicon; llm-speed measures it via the mlx-lm package and reports it under the backend label 'mlx'.
llama.cpp
llama.cpp is a C/C++ runtime for transformer inference with first-class Metal, CUDA, ROCm, and CPU backends; llm-speed measures it via the llama-server / llama-cli binary on PATH and reports the build version.
vLLM
vLLM is a high-throughput inference server with PagedAttention and continuous batching, primarily targeting CUDA datacenter accelerators; llm-speed measures it via its OpenAI-compatible HTTP server or the vllm Python package.
ollama
ollama is a daemon that wraps llama.cpp with a model registry and a local REST API on port 11434; llm-speed measures it as a separate backend from llama.cpp because the surrounding harness affects TTFT.
exllamav2
exllamav2 is a CUDA-only inference library specialized for fast decoding of GPTQ/EXL2-quantized weights on consumer NVIDIA GPUs; llm-speed measures it via the exllamav2 Python package.
hosted-api
hosted-api is the llm-speed CLI backend label for OpenAI-compatible HTTP endpoints; it is gated behind an explicit opt-in (LLMSPEED_ENABLE_HOSTED_API=1) and is intended for private benchmarking only — uploads of hosted-API runs are rejected by the public leaderboard, because most provider terms forbid republishing third-party speed benchmarks.
--strict-anon
--strict-anon is the llm-speed CLI mode in which a fresh Ed25519 keypair is generated in memory for every run, the fingerprint hash is omitted from the payload, and no User-Agent or X-LLM-Speed-Anon header is sent, so consecutive submissions from the same machine are unlinkable from the server's perspective.
--anon
--anon is the llm-speed CLI mode that sends an X-LLM-Speed-Anon: 1 header instructing the server not to associate the upload with any account, while keeping the persistent keypair and fingerprint hash for outlier-cluster grouping.
--no-upload
--no-upload is the llm-speed CLI mode that runs the workloads, saves the signed result to ~/.cache/llm-speed/runs/, and makes no network call at all; the result can be re-uploaded later with 'llm-speed bench --resume <path>'.
outlier flag
an outlier flag is a server-side marker on a benchmark result whose decode tok/s falls more than 3 sigma from the mean of its (model x hardware x backend) cluster; flagged runs remain visible and may be challenged in a public dispute thread.
run badge
the run badge is an embeddable SVG image at llm-speed.com/badge/<run_id>.svg that renders a single run's headline decode tokens-per-second, model, and hardware in one image; intended for README files, blog posts, and hardware-review embeds. Added 2026-04-29.
Markdown embed: [![llm-speed](https://llm-speed.com/badge/r_y2_5y8oo97d.svg)](https://llm-speed.com/r/r_y2_5y8oo97d). The SVG is served with cache-control max-age=14400 and stale-while-revalidate, so the embed reflects the canonical run page without re-fetching on every view. Click-through resolves to the run permalink, which is the verifiable source of the number.

Models & hardware

Qwen3-Coder-Next
Qwen3-Coder-Next is Alibaba Cloud's coding-tuned large language model in the Qwen3 family, succeeding Qwen2.5-Coder; on llm-speed, Qwen3-Coder-Next runs are aggregated at llm-speed.com/m/qwen3-coder-next with hardware-side decode tok/s, prefill tok/s, and TTFT measured under suite-v1.
For Qwen3-Coder-Next coding-quality scores (HumanEval, MBPP, SWE-Bench Verified, LiveCodeBench), see the Qwen team's GitHub at github.com/QwenLM and the model card on Hugging Face. llm-speed publishes wall-clock speed only.
Llama 3.3 70B
Llama 3.3 70B is Meta's 70-billion-parameter instruction-tuned language model in the Llama 3 family; llm-speed aggregates Llama 3.3 70B runs at llm-speed.com/m/llama-3-3-70b-instruct, sorted by decode tokens per second across backends and hardware.
At Q4_K_M quantization the model is roughly 38 GB on disk and fits on an RTX 5090 (32 GB VRAM) only with aggressive quantization or short context, on dual RTX 4090s with NVLink-style sharding, or on Apple Silicon (M3 Ultra up to 512 GB unified) with substantial headroom.
Qwen2.5-7B-Instruct
Qwen2.5-7B-Instruct is Alibaba Cloud's 7-billion-parameter instruction-tuned model in the Qwen2.5 family; llm-speed aggregates Qwen2.5-7B-Instruct runs at llm-speed.com/m/qwen2-5-7b-instruct.
At Q4_K_M the model is roughly 4 GB on disk, fits comfortably on Apple Silicon (M-series Pro / Max / Ultra) and any consumer GPU with 8 GB or more of VRAM, and is a common drop-in for local coding assistants alongside Qwen2.5-Coder-7B.
Llama-4-Scout
Llama-4-Scout is Meta's mixture-of-experts model in the Llama 4 family, roughly 109B total parameters with 17B active per token; on llm-speed, local-hardware Llama-4-Scout runs are aggregated at llm-speed.com/m/llama-4-scout.
Llama-4-Scout fits on M3 Ultra at 4-bit but overflows RTX 5090 32 GB VRAM at full context. To produce a measured number, run "llm-speed bench --models meta-llama/Llama-4-Scout --backends mlx,llama.cpp" on M3 Ultra.
Llama-4-Maverick
Llama-4-Maverick is Meta's larger mixture-of-experts model in the Llama 4 family; on llm-speed, local-hardware Llama-4-Maverick runs are aggregated at llm-speed.com/m/llama-4-maverick.
stable-code-instruct-3b
stable-code-instruct-3b is Stability AI's 3-billion-parameter coding-instruct model; aggregated on llm-speed at llm-speed.com/m/stable-code-instruct-3b.
Currently the fastest local model on M3 Ultra: 192.5 decode tok/s, prefill 560.7 tok/s, TTFT 226.5 ms via mlx 0.31.3 (run r_y2_5y8oo97d). The same model on M3 Pro 36 GB runs at 19.4 decode tok/s (run r_pqjsvd-cub4), giving a clean ~10x scaling factor between the two Apple Silicon SKUs on identical bytes.
Qwen3-Coder-30B-A3B-Instruct
Qwen3-Coder-30B-A3B-Instruct is Alibaba Cloud's 30-billion-parameter active-3B mixture-of-experts coding-tuned model in the Qwen3 family; aggregated on llm-speed at llm-speed.com/m/qwen3-coder-30b-a3b-instruct.
Currently the fastest 30B-class local coding model on M3 Ultra: 112.2 decode tok/s, prefill 204.0 tok/s, TTFT 539.3 ms via mlx 0.31.3 (run r_fpsca03u2o_). The MoE architecture (3B active per token) keeps decode tok/s much higher than a dense 30B at the same hardware budget.
gpt-oss-20b
gpt-oss-20b is OpenAI's open-weights 20-billion-parameter model in MXFP4-Q4 quantization; aggregated on llm-speed at llm-speed.com/m/gpt-oss-20b.
Current top local run: 152.7 decode tok/s, prefill 239.9 tok/s, TTFT 692.1 ms on M3 Ultra (60-core GPU) + 96 GB unified via mlx 0.31.3 (run r_3ijun8ltjnb). MXFP4 is a 4-bit microscaled floating-point format; on Apple Silicon the MLX backend handles MXFP4 weights natively.
DeepSeek-Coder-V2-Lite-Instruct
DeepSeek-Coder-V2-Lite-Instruct is DeepSeek's coding-tuned mixture-of-experts model; aggregated on llm-speed at llm-speed.com/m/deepseek-coder-v2-lite-instruct.
Current top run: 168.3 decode tok/s, prefill 291.5 tok/s, TTFT 449.5 ms on M3 Ultra (60-core GPU) + 96 GB unified via mlx 0.31.3 (run r_l_v1-zq_qaz). Second-fastest local model on the M3 Ultra leaderboard as of 2026-04-29.
Qwen2.5-Coder-32B-Instruct-4bit
Qwen2.5-Coder-32B-Instruct-4bit is Alibaba Cloud's 32-billion-parameter dense coding-tuned model at 4-bit; aggregated on llm-speed at llm-speed.com/m/qwen2-5-coder-32b-instruct.
Current top run: 34.5 decode tok/s, prefill 144.1 tok/s, TTFT 909.4 ms on M3 Ultra (60-core GPU) + 96 GB unified via mlx 0.31.3 (run r_721b4bls_oq). At dense 32B parameters, decode tok/s is roughly 3x lower than the same hardware running a 30B-active-3B MoE coder, which is the architectural win MoE buys.
M3 Ultra
M3 Ultra is Apple's highest-end SoC in the M3 family, configured by fusing two M3 Max dies; it ships with up to 80 GPU cores, up to 512 GB of unified memory, and roughly 800-900 GB/s of memory bandwidth, making it the first consumer Apple Silicon part that can hold a 70B model at full Q4_K_M weights with context headroom.
On llm-speed, M3 Ultra runs are aggregated at llm-speed.com/hw/m3-ultra; both MLX and llama.cpp backends are supported and sorted side-by-side in the cheatsheet.
M3 Pro
M3 Pro is Apple's mid-range SoC in the M3 family, with 14- or 18-core GPU configurations, 18 GB or 36 GB of unified memory, and roughly 150 GB/s of memory bandwidth; it runs 7B-class models comfortably at 4-bit and is the most common machine in the current llm-speed dataset.
On llm-speed, M3 Pro runs are aggregated at llm-speed.com/hw/m3-pro; the 18-core / 36 GB SKU and the 14-core / 18 GB SKU are reported with the same accelerator_summary granularity that the cheatsheet uses for cross-rig comparison.
RTX 5090
RTX 5090 is NVIDIA's flagship consumer GPU in the Blackwell generation, with 32 GB of GDDR7 VRAM on a 512-bit bus delivering roughly 1,792 GB/s of memory bandwidth; on llm-speed, RTX 5090 runs are aggregated at llm-speed.com/hw/rtx-5090.
Decode tokens-per-second on consumer GPUs is dominated by memory bandwidth; RTX 5090's 1,792 GB/s is roughly 1.8x the RTX 4090's 1,008 GB/s, so the same model and quant typically decode 1.5x to 2x faster on a 5090 than on a 4090.
RTX 4090
RTX 4090 is NVIDIA's flagship consumer GPU in the Ada Lovelace generation, with 24 GB of GDDR6X VRAM and roughly 1,008 GB/s of memory bandwidth; on llm-speed, RTX 4090 runs are aggregated at llm-speed.com/hw/rtx-4090.

Cite a specific run as llm-speed.com/r/<id>, a per-model page as llm-speed.com/m/<slug>, and a per-hardware page as llm-speed.com/hw/<slug>.