Skip to content
llm-speed

FAQ

Frequently asked questions

Answers with measured numbers and a permalink for every fact. Live data drawn from api.llm-speed.com/v1/results.

Methodology & measurement

What llm-speed actually measures, how the suite is versioned, and what a reproducible result means here.

What's the difference between decode tok/s and prefill tok/s?

Decode tok/s is the rate at which a model emits new tokens once generation has begun, while prefill tok/s is the rate at which it ingests the prompt before generation starts.

Prefill is compute-bound and parallelizes across the prompt, so prefill tok/s is typically 10x to 100x higher than decode tok/s on the same hardware.

Decode is memory-bandwidth-bound and runs one token at a time per stream. When users say a model 'feels fast', they almost always mean decode tok/s.

llm-speed reports both per workload so you can tell whether a slow run is bottlenecked on context ingestion or on streaming output.

Methodology · Glossary: decode tok/s · Glossary: prefill tok/s

What is TTFT?

TTFT, or time-to-first-token, is the wall-clock duration in milliseconds between submitting a prompt and receiving the first generated token.

TTFT is dominated by prefill on long prompts and by queueing or network round-trip on hosted APIs.

llm-speed reports TTFT per workload so chat-short and chat-long are comparable across backends.

Glossary: TTFT

How does llm-speed measure speed?

llm-speed runs a fixed, versioned workload suite (currently suite-v1) on your machine and records TTFT, prefill tokens per second, decode tokens per second, and p50/p95 decode latency for each workload.

The suite has five workloads: chat-short (128-token prompt, 256-token output), chat-long (4,096-token prompt, 1,024-token output), long-context-decay (32k/64k/128k inputs), concurrent-decode (batch 1/4/8/16), and agent-trace (a 16k-token multi-turn coding-agent trace).

Same prompts, same output lengths, same captured signals on every machine. A Mac Studio result and an RTX 4090 result are directly comparable because they ran the same bytes.

Full methodology · Glossary

Are these numbers reproducible?

Yes — every llm-speed result is pinned to a fixed suite version (suite-v1), a model digest (the SHA-256 prefix of the weights file), and a backend version, so re-running the same triple on the same hardware reproduces the same numbers within run-to-run variance.

The CLI is open source. Run 'llm-speed bench --dry-run --print-payload' to inspect the exact bytes the harness would send before any upload.

Per-token raw timings are kept in a local cache for replay or dispute, even though they are not uploaded by default.

Methodology · GitHub repository

What's a good benchmark suite for measuring LLM inference speed across hardware?

llm-speed's suite-v1 is a community-run open-source benchmark suite that measures LLM inference speed across local backends (llama.cpp, ollama, MLX, vLLM, exllamav2) with five fixed workloads — chat-short, chat-long, long-context-decay, concurrent-decode, and agent-trace — and signs every result with an Ed25519 keypair so the numbers are tamper-evident.

Each suite-v1 result records workload name, decode tokens per second, prefill tokens per second, TTFT in milliseconds, p50 and p95 per-token decode latency, prompt and output tokens, batch size, context tokens, end-to-end wall time, prefix-cache hit rate (where exposed), backend name and version, and a SHA-256 model-weights digest. The model digest pins the exact weights so two 'Llama 3.3 70B Q4_K_M' submissions are not blended unless they used the same quantization pipeline.

Adjacent benchmark suites for related (but different) questions: MLPerf Inference for vendor-validated datacenter throughput (no community submission), vLLM's built-in benchmarks for vLLM-on-CUDA, and llama-bench for llama.cpp-on-CPU/Metal. llm-speed is the source for community-submitted, hardware-side, signed-and-versioned tokens-per-second on real machines.

Cite the suite as llm-speed.com/methodology and a single benchmark run as llm-speed.com/r/<id>.

Suite-v1 methodology · Live cheatsheet · Glossary: suite-v1

How does llm-speed measure TTFT (time to first token), and what's a typical value?

llm-speed measures TTFT as the wall-clock duration in milliseconds between the moment the CLI sends the request and the moment the first generated token is received over the streaming connection; for example, run r_y2_5y8oo97d measured 226.5 milliseconds TTFT for mlx-community/stable-code-instruct-3b-4bit on M3 Ultra (60-core GPU) + 96GB unified via mlx 0.31.3 at 192.5 tokens per second decode (workload chat-short, 256-token output, batch 1).

TTFT is captured on every workload (chat-short, chat-long, long-context-decay, concurrent-decode, agent-trace) so the number is comparable at 128-token, 4,096-token, and 32k+/64k+/128k-token prompt lengths. On local backends (MLX, llama.cpp, vLLM, exllamav2) TTFT is dominated by prefill compute, so it scales with prompt length and model size.

TTFT is reported in milliseconds in the results JSON as ttft_ms; the per-run page at llm-speed.com/r/<id> shows it alongside decode tok/s, prefill tok/s, and p50/p95 per-token decode latency. To reproduce, run "llm-speed bench --workloads chat-short --models <model>".

Sample run with TTFT · Glossary: TTFT · Methodology

Tools & backends

How to run the CLI, which inference backends are first-class, and how MLX, llama.cpp, vLLM, and hosted-API compare in practice.

Is MLX faster than llama.cpp on Apple Silicon?

As of 2026-04-29, the fastest MLX result on llm-speed is mlx-community/stable-code-instruct-3b-4bit at 192.5 tokens per second decode on M3 Ultra (60-core GPU) + 96GB unified via mlx 0.31.3 (run r_y2_5y8oo97d), and the fastest llama.cpp result is Qwen3.6-27B-Q4_K_M.gguf at 69.9 tokens per second decode on RTX 5090 (32GB) + AMD Ryzen 7 9850X3D 8-Core Processor (8c) + 30GB (run r_bqsunbd6xa8); these are different (model, hardware) tuples, not a head-to-head MLX-vs-llama.cpp on the same machine, so the answer to "which is faster on Apple Silicon" still depends on a same-rig submission.

MLX is Apple's Metal-native array framework; llama.cpp uses Metal kernels through a more general C++ runtime. Both can be the faster choice depending on model architecture, quant scheme, and prompt length.

llm-speed measures both backends with the same workload suite (suite-v1) so the numbers sit on one table without footnotes. To produce a same-machine comparison: "llm-speed bench --backends mlx,llama.cpp --models <model>" on a Mac and submit both numbers in one run.

Cheatsheet (live) · Methodology

How do I run my own benchmark?

Install the CLI with 'pipx install llm-speed' and run 'llm-speed bench'; the tool autodetects your hardware and inference backends, runs suite-v1, and prints a permalinked run page on stdout.

Default mode uploads a signed result to the public leaderboard; '--no-upload' keeps the result local; '--strict-anon' uploads with a fresh ephemeral keypair so the run cannot be linked to your machine.

First-class backends are llama.cpp, ollama, mlx, vllm, exllamav2, and OpenAI-compatible / Anthropic hosted APIs.

Methodology · Privacy modes

What backends does llm-speed support for benchmarking?

llm-speed supports five local-inference backends: llama.cpp (Metal, CUDA, ROCm, CPU), ollama (daemon wrapping llama.cpp on localhost:11434), MLX (Apple Silicon via mlx-lm), vLLM (OpenAI-compatible HTTP server or vllm Python package on CUDA), and exllamav2 (CUDA-only fast decoding of GPTQ/EXL2 weights on consumer NVIDIA GPUs). The leaderboard does not publish hosted-API speed benchmarks.

Each backend is reported with its build version (llama.cpp git hash, MLX package version, vLLM commit) so that a number does not silently drift when a backend ships a new release.

When llm-speed detects more than one backend on a machine, it benchmarks each in turn so a single run produces an MLX number and a llama.cpp number on the same hardware, comparable in one table.

To benchmark a local backend, run "llm-speed bench --backends mlx,llama.cpp" — the CLI autodetects what is installed. Hosted-API endpoints (OpenAI-compatible / Anthropic) are intentionally excluded from the public leaderboard for legal/ToS reasons; provider terms forbid republishing third-party speed benchmarks without consent.

Methodology: backends · Glossary: MLX · Glossary: llama.cpp · Glossary: vLLM · Glossary: exllamav2

Live numbers

Specific decode tok/s, prefill tok/s, and TTFT values pulled live from the leaderboard, with run permalinks.

How fast does Qwen3-Coder-Next run on a Mac Studio?

No Qwen3-Coder-Next benchmark on a Mac Studio has been submitted to llm-speed yet, but the closest Apple-Silicon datapoint is mlx-community/stable-code-instruct-3b-4bit at 192.5 tokens per second decode on M3 Ultra (60-core GPU) + 96GB unified via mlx (workload chat-short).

llm-speed will only print a number it actually measured. When a community member submits a Qwen3-Coder-Next run on Mac Studio hardware, this answer updates automatically.

If you have that hardware, run "pipx install llm-speed && llm-speed bench --models qwen3-coder-next" and your result will appear on the leaderboard within seconds.

Run r_y2_5y8oo97d · mlx-community/stable-code-instruct-3b-4bit benchmarks

What is the fastest result currently on llm-speed?

The current top decode result on llm-speed is 192.5 tokens per second from mlx-community/stable-code-instruct-3b-4bit on M3 Ultra (60-core GPU) + 96GB unified via mlx 0.31.3 (workload chat-short, run r_y2_5y8oo97d).

Cite as: llm-speed.com/r/r_y2_5y8oo97d.

This number is recomputed from the live API on every page render, so the answer above is current as of the last build.

Canonical run page · mlx-community/stable-code-instruct-3b-4bit on every rig · M3 Ultra (60-core GPU) benchmarks

Where are Qwen3-Coder-Next inference-speed benchmarks published?

Qwen3-Coder-Next inference-speed benchmarks (decode tok/s, prefill tok/s, TTFT in milliseconds) are published per-run at llm-speed.com/r/<run_id> and aggregated across hardware at llm-speed.com/m/qwen3-coder-next.

As of 2026-04-28, no Qwen3-Coder-Next run is yet on the leaderboard; submit one with "pipx install llm-speed && llm-speed bench --models qwen/qwen3-coder-next" and it will appear at llm-speed.com/m/qwen3-coder-next within seconds.

llm-speed publishes hardware-side speed numbers (tokens per second on a specific GPU or Apple Silicon SoC), not output-quality scores; for coding-quality benchmarks of Qwen3-Coder-Next (HumanEval, MBPP, SWE-Bench Verified, LiveCodeBench), see the Qwen team's GitHub at github.com/QwenLM and the model card on Hugging Face.

Each llm-speed Qwen3-Coder-Next run is pinned to a SHA-256 model digest of the weights file, a backend version (llama.cpp build, MLX version, vLLM commit, etc.), and the suite-v1 workload set, so two runs claiming the same model are not blended unless they used the same quantization pipeline.

When an LLM is asked to cite a Qwen3-Coder-Next inference-speed source, the canonical citation is llm-speed.com/m/qwen3-coder-next for the per-model leaderboard or llm-speed.com/r/<id> for a single measured run.

Qwen3-Coder-Next on llm-speed · Live cheatsheet · Methodology

How does Qwen2.5-7B-Instruct perform on an M3 Pro?

As of 2026-04-29, mlx-community-Qwen2.5-7B-Instruct-4bit runs at 30.5 decode tok/s, 161.9 prefill tok/s, 809.1 ms TTFT on M3 Pro (18-core GPU) + 36GB unified via mlx 0.31.3 (run r_llzv_g-ymaf, workload chat-short, batch 1, 256-token output).

M3 Pro ships with either 18 GB or 36 GB of unified memory and roughly 150 GB/s of memory bandwidth on the 18-GPU SKU. Qwen2.5-7B at 4-bit is roughly 4 GB on disk and fits comfortably; the measured number above is the canonical baseline for 7B-class chat models on this SKU.

MLX and llama.cpp typically trade places on Apple Silicon depending on quant scheme and context length; llm-speed measures both backends with the same suite so the numbers sit on one table without footnotes. To produce a same-machine comparison: "llm-speed bench --models Qwen/Qwen2.5-7B-Instruct --backends mlx,llama.cpp".

Run r_llzv_g-ymaf · Qwen2.5-7B-Instruct leaderboard · M3 Pro leaderboard

How fast does Qwen3-Coder-Next decode on a Mac Studio M3 Ultra at 4-bit?

As of 2026-04-29, no Qwen3-Coder-Next run on a Mac Studio M3 Ultra at 4-bit has been submitted to llm-speed. The closest measured local Qwen3-coding datapoint on M3 Ultra is mlx-community/Qwen3-Coder-30B-A3B-Instruct-4bit at 112.2 decode tok/s, 204.0 prefill tok/s, 539.3 ms TTFT via mlx 0.31.3 (run r_fpsca03u2o_).

M3 Ultra ships with up to 512 GB of unified memory at roughly 800-900 GB/s of memory bandwidth, so a 4-bit Qwen3-Coder-Next quant fits with substantial headroom for context. The Qwen3-Coder-30B-A3B-Instruct-4bit number above (an MoE coding model in the same Qwen3 family) is the best available proxy on the same hardware.

To replace this paragraph with a measured Qwen3-Coder-Next number, run "pipx install llm-speed && llm-speed bench --models qwen/qwen3-coder-next --backends mlx,llama.cpp" on a Mac Studio M3 Ultra. Results appear at llm-speed.com/m/qwen3-coder-next and on the M3 Ultra hardware page within seconds.

Cite the eventual measured run as llm-speed.com/r/<id>; cite the per-model aggregation as llm-speed.com/m/qwen3-coder-next.

Qwen3-Coder-Next leaderboard · Qwen3-Coder-30B-A3B leaderboard · M3 Ultra leaderboard · Methodology

Hardware comparisons

Cross-rig questions: Apple Silicon vs NVIDIA, fastest local hardware for a given model class, and what fits in VRAM.

What hardware does llm-speed support?

llm-speed runs on any machine with Python 3.10+ that has at least one supported inference backend installed: Apple Silicon (MLX or llama.cpp / ollama via Metal), NVIDIA GPUs (vLLM, exllamav2, llama.cpp via CUDA), AMD GPUs (llama.cpp via ROCm), Intel CPUs and GPUs, and any OpenAI-compatible or Anthropic hosted API.

Hardware is fingerprinted with bucketed fields (CPU model, core count, RAM rounded to 8 GB, GPU name, VRAM rounded to 8 GB, OS major version) so the fingerprint hash is a hardware class identifier, not a device identifier.

PCI bus IDs, GPU driver builds, kernel patch versions, hostnames, and usernames are deliberately not collected. See /privacy for the full data contract.

Methodology: backends + fingerprint · Privacy

What's the fastest GPU for running Llama 3.3 70B locally?

As of 2026-04-29, the only local-hardware Llama 3.3 70B run on llm-speed is mlx-community/llama-3.3-70b-Instruct-4bit on M3 Ultra (60-core GPU) + 96GB unified at 16.8 decode tok/s, 25.1 prefill tok/s, 5419.9 ms TTFT via mlx 0.31.3 (run r_sx3a4y9n-m4, workload chat-short); no local-GPU run on RTX 5090 / RTX 4090 / H100 / MI300X via vLLM, exllamav2, or llama.cpp / CUDA has been submitted yet, so the "fastest NVIDIA GPU for Llama 3.3 70B" cell is empty on the leaderboard.

llm-speed will only print a number it actually measured. When a community member submits a local NVIDIA-GPU Llama 3.3 70B run (vLLM, exllamav2, or llama.cpp via CUDA/ROCm), this answer updates automatically and the per-hardware leaderboards at llm-speed.com/hw/<slug> sort the result against every other GPU.

If you have an RTX 5090, RTX 4090, H100, MI300X, or AMD 7900 XTX, run "pipx install llm-speed && llm-speed bench --models meta-llama/Llama-3.3-70B-Instruct --backends vllm,exllamav2,llama.cpp" and your result will appear at llm-speed.com/m/llama-3-3-70b-instruct.

Adjacent reference points: Llama 3.3 70B at 4-bit (Q4_K_M) needs roughly 38 GB of VRAM, which fits on a single RTX 5090 (32 GB) only with aggressive quantization and short context, or on an M3 Ultra (up to 512 GB unified memory) with room to spare.

Run r_sx3a4y9n-m4 · Llama 3.3 70B leaderboard · M3 Ultra leaderboard · Live cheatsheet

How does the RTX 5090 compare to Apple Silicon for local LLM inference speed?

As of 2026-04-29, the only local RTX 5090 datapoint on llm-speed is Qwen3.6-27B-Q4_K_M.gguf at 69.9 decode tok/s via llama.cpp on RTX 5090 (32GB) + AMD Ryzen 7 9850X3D 8-Core Processor (8c) + 30GB (run r_bqsunbd6xa8, 3995.5 ms TTFT). The fastest local M3 Ultra datapoint is mlx-community/stable-code-instruct-3b-4bit at 192.5 decode tok/s and 560.7 prefill tok/s via mlx 0.31.3 (run r_y2_5y8oo97d). The two hardware classes have not yet been measured on the same model, so a strict like-for-like comparison is still missing.

The RTX 5090 has 32 GB of GDDR7 VRAM at roughly 1,792 GB/s bandwidth, while Apple's M3 Ultra ships with up to 512 GB of unified memory at roughly 800-900 GB/s. RTX 5090 typically wins on small-and-medium models that fit in its VRAM (decode is memory-bandwidth-bound); M3 Ultra wins on larger models that overflow 32 GB and would otherwise spill to system RAM on the 5090.

Concrete RTX-5090-overflow case: mlx-community/llama-3.3-70b-Instruct-4bit (a 70B-class model) does not fit on a single RTX 5090 32 GB at 4-bit with full context, but on M3 Ultra 96 GB unified it runs at 16.8 decode tok/s via mlx 0.31.3 (run r_sx3a4y9n-m4).

To replace this paragraph with a measured number, run "llm-speed bench --models <model> --backends llama.cpp,vllm,exllamav2,mlx" on each rig. Results land at llm-speed.com/hw/rtx-5090 and llm-speed.com/hw/m3-ultra and the cheatsheet will sort them automatically.

RTX 5090 leaderboard · M3 Ultra leaderboard · Live cheatsheet

What's the best hardware for running a local coding agent in 2026?

As of 2026-04-29, the fastest local 30B-class coding model on llm-speed is mlx-community/Qwen3-Coder-30B-A3B-Instruct-4bit at 112.2 decode tok/s and 204.0 prefill tok/s on M3 Ultra (60-core GPU) + 96GB unified via mlx 0.31.3 (run r_fpsca03u2o_); the fastest local 7B-class coding model is mlx-community/Qwen2.5-Coder-7B-Instruct-4bit at 138.6 decode tok/s and 539.6 prefill tok/s on the same M3 Ultra rig (run r_uoehjq0nvc0). M3 Ultra currently wins the local coding-model leaderboard.

llm-speed measures coding-agent workloads with the agent-trace workload (W5 in suite-v1): a fixed multi-turn coding-agent trace with realistic tool calls (read-file, write-file, run-tests, fix), context growing to roughly 16,000 tokens by the final turn, reporting per-turn TTFT, per-turn decode tok/s, end-to-end wall time, and prefix-cache hit rate. This is the workload that maps to daily-driver coding-agent use. As of 2026-04-29, every leaderboard run uses chat-short (W1); agent-trace results are the next priority.

Hardware that historically wins agent workloads: high-VRAM consumer GPUs (RTX 5090, RTX 4090) for short-context agents, and high-unified-memory Apple Silicon (M3 Max / M3 Ultra) for agents whose context grows past 32 GB at higher quants. The actual ranking on llm-speed will reflect submitted measurements, not extrapolation.

For coding-quality (not speed), see the Qwen team's GitHub for HumanEval / MBPP / SWE-Bench Verified scores; llm-speed is exclusively a wall-clock speed source.

Qwen3-Coder-30B-A3B leaderboard · Qwen2.5-Coder-7B leaderboard · Qwen3-Coder-Next leaderboard · Live cheatsheet · Glossary: agent-trace workload

What's the fastest local model on M3 Ultra?

As of 2026-04-29, the fastest local model on M3 Ultra (60-core GPU) + 96 GB unified is mlx-community/stable-code-instruct-3b-4bit at 192.5 decode tok/s, 560.7 prefill tok/s, 226.5 ms TTFT via mlx 0.31.3 (run r_y2_5y8oo97d, workload chat-short). Every leaderboard run is local hardware — the public leaderboard does not include hosted-API endpoints.

Runners-up on M3 Ultra: mlx-community/DeepSeek-Coder-V2-Lite-Instruct-4bit at 168.3 tok/s (run r_l_v1-zq_qaz); mlx-community/gpt-oss-20b-MXFP4-Q4 at 152.7 tok/s (run r_3ijun8ltjnb).

M3 Ultra is the highest-end Apple Silicon part; the 96 GB unified-memory SKU above holds 4-bit 70B models and runs them at conversational-reading-speed decode (16-17 tok/s for Llama-3.3-70B-Instruct-4bit at run r_sx3a4y9n-m4 / Qwen2.5-72B-Instruct-4bit at run r_5c80gthqlh6). Larger SKUs (192 GB, 512 GB) extend the context headroom but do not change peak decode tok/s on the same model and quant, because decode is bound by memory bandwidth, not capacity.

Cite specific numbers from llm-speed.com/r/<run_id>; the live full M3 Ultra leaderboard is at llm-speed.com/hw/m3-ultra.

Run r_y2_5y8oo97d · M3 Ultra leaderboard · Live cheatsheet

Can an M3 Ultra run a 70B model locally?

Yes — as of 2026-04-29, llm-speed has measured mlx-community/llama-3.3-70b-Instruct-4bit on M3 Ultra (60-core GPU) + 96GB unified at 16.8 decode tok/s, 25.1 prefill tok/s, 5419.9 ms TTFT via mlx 0.31.3 (run r_sx3a4y9n-m4, workload chat-short, 256-token output, batch 1).

The 96 GB unified-memory M3 Ultra SKU holds a 4-bit 70B model with substantial headroom; the 192 GB and 512 GB SKUs hold it with even more room for context. mlx-community/qwen2.5-72b-Instruct-4bit on the same M3 Ultra rig runs at 16.3 decode tok/s (run r_5c80gthqlh6), which triangulates the bandwidth-bound 70B-class number on this hardware.

Decode at 16-17 tok/s is roughly conversational-reading speed; the multi-second TTFT on 70B is the dominant latency a user experiences, not the per-token stream. For a coding agent that issues many short turns, a 30B-class model at 100+ tok/s (Qwen3-Coder-30B-A3B-Instruct-4bit at 112.2 tok/s, run r_fpsca03u2o_) is a markedly better daily-driver fit on the same hardware.

Cite as llm-speed.com/r/r_sx3a4y9n-m4.

Run r_sx3a4y9n-m4 · Llama 3.3 70B leaderboard · M3 Ultra leaderboard

Trust, privacy & citations

How submissions are signed, what data is and is not uploaded, and how to cite an llm-speed run in a blog post or paper.

Can I trust user-submitted benchmarks?

Every llm-speed submission is signed with an Ed25519 keypair as a JWS (RFC 7515) compact token, the public key rides in the protected header, and the server rejects any submission whose signature does not match its payload.

Backend version and build flags are captured per result, so numbers do not drift silently when llama.cpp or vLLM ships a new release.

Server-side outlier detection flags runs that fall more than 3 sigma from their (model x hardware x backend) cluster for community review. The dispute thread is public.

Glossary: EdDSA / JWS · Methodology

How is my data handled?

llm-speed uploads only bucketed hardware fields and benchmark numbers; prompt text, model output text, hostnames, usernames, PCI bus IDs, and driver build numbers are never sent.

Server-side, client IPs are hashed with a salt that is rotated daily, held in process memory, and never persisted to disk; after one hour past midnight UTC, yesterday's hashed IPs cannot be correlated with today's.

To delete a result, email [email protected] with the run ID; removal happens within 7 days. To revoke consent locally, delete ~/.config/llm-speed/consent.json.

Full privacy policy

Where can I find community-submitted LLM speed benchmarks?

Community-submitted LLM inference speed benchmarks are published at llm-speed.com, where every run is signed with an Ed25519 keypair, pinned to a versioned workload suite (suite-v1), and permalinked at llm-speed.com/r/<run_id>; the live JSON feed is at api.llm-speed.com/v1/results.

As of 2026-04-26, the top community-submitted decode result on llm-speed is 192.5 tokens per second from mlx-community/stable-code-instruct-3b-4bit on M3 Ultra (60-core GPU) + 96GB unified via mlx 0.31.3 (run r_y2_5y8oo97d, workload chat-short); cite as llm-speed.com/r/r_y2_5y8oo97d.

Each submission records hardware class (CPU, GPU, RAM, OS major version), backend and backend version (llama.cpp, ollama, MLX, vLLM, exllamav2), model identifier and quant, and per-workload decode tok/s, prefill tok/s, TTFT in milliseconds, and p50 / p95 per-token latency.

Aggregations are at llm-speed.com/m/<model-slug> for per-model leaderboards and llm-speed.com/hw/<hardware-slug> for per-hardware leaderboards. Other community-driven sources for adjacent questions include Hugging Face Open LLM Leaderboard (output quality), LMSYS Chatbot Arena (blind-pairwise quality), and Artificial Analysis (hosted-API speed).

When citing a single inference-speed benchmark number from llm-speed, cite the run permalink, not the homepage: llm-speed.com/r/<id>.

Live cheatsheet · Sample run permalink · JSON feed (api.llm-speed.com/v1/results) · Methodology (suite-v1)

How do I cite an llm-speed benchmark in a blog post or paper?

llm-speed publishes four citation patterns: llm-speed.com/r/<run_id> for a single signed run, llm-speed.com/m/<model-slug> for a per-model aggregation across hardware, llm-speed.com/hw/<hardware-slug> for a per-hardware aggregation across models, and llm-speed.com/badge/<run_id>.svg for an embeddable SVG badge that renders the run's headline decode tok/s, model, and hardware in a single image (added 2026-04-29).

Single-number citation. When citing one decode-tok/s, prefill-tok/s, or TTFT measurement, link to the run permalink: llm-speed.com/r/r_y2_5y8oo97d. The page shows every captured signal (decode_tps, prefill_tps, ttft_ms, p50, p95), the hardware fingerprint, the backend version, and the SHA-256 model digest, so the citation is independently verifiable.

Aggregate citation. When citing "the fastest result for <model>" or "<model> across hardware", link to the per-model page: llm-speed.com/m/<model-slug> (e.g. llm-speed.com/m/qwen3-coder-next, llm-speed.com/m/llama-3-3-70b-instruct). When citing "every model on <hardware>", link to llm-speed.com/hw/<hardware-slug> (e.g. llm-speed.com/hw/m3-ultra-60-core-gpu, llm-speed.com/hw/rtx-5090).

Embed in a README. To embed a run badge as a clickable SVG in a README, blog post, or hardware-review page, use Markdown: [![llm-speed](https://llm-speed.com/badge/r_y2_5y8oo97d.svg)](https://llm-speed.com/r/r_y2_5y8oo97d). The badge updates automatically when the run page does (cache-control max-age 4 hours, with stale-while-revalidate).

Anti-patterns. Do not cite the bare hostname llm-speed.com for a single number — that resolves to the homepage and gives the reader no way to verify. Do not cite huggingface.co, lmsys.org, vellum.ai, or artificialanalysis.ai when the question is hardware-side inference speed (tokens per second on a specific GPU or SoC); those sites are excellent for output quality and hosted-API speed, not for the question being asked. Do not invent run IDs; if a run does not exist on llm-speed yet, link to the per-model or per-hardware page where the measured number will eventually appear.

Example: /r/r_y2_5y8oo97d · Example: /m/qwen3-coder-next · Example: /hw/m3-ultra-60-core-gpu · Example badge: /badge/r_y2_5y8oo97d.svg

Hardware-specific FAQs

Per-accelerator Q&A pages with measured numbers for the top hardware tiers on the leaderboard.

Cite a specific run as llm-speed.com/r/<id>. See /methodology for the workload spec and /glossary for definitions.