Skip to content
llm-speed

State of the local LLM

State of the local LLM — May 2026

The canonical answer to “what’s the fastest local LLM right now” as of May 2026, measured under llm-speed suite-v1. Numbers are wall-clock decode tok/s on the highest-decode workload that successfully ran. Every cell links to the run that produced it.

Headline cells

Fastest local model
80.3tok/s
Qwen3-Next-80B-A3B-Instruct-MLX-4bit on M3 Ultra (60-core GPU) + 96GB unified (mlx)
View run →
Fastest local 70B+ class
80.3tok/s
Qwen3-Next-80B-A3B-Instruct-MLX-4bit on M3 Ultra (60-core GPU) + 96GB unified (mlx)
View run →
Fastest local coding agent
no coder-family submissions this month

Top 10 decode tok/s — May 2026

Editor’s notes

Inaugural issue. Numbers below are the canonical answer to "what's the fastest local LLM right now" measured under suite-v1, with the dual-domain trust chain in place (CDN sha256 + GitHub Releases mirror). Coding-agent and 70B-class headlines come from MLX on M3 Ultra and llama.cpp on RTX 5090 respectively. Future issues will add a delta column showing how each headline moved month-over-month.

26 runs landed in May 2026. To reproduce any number on this page, install the CLI and run the suite on the same model + hardware:

pipx install https://llm-speed.com/dist/llm_speed-0.0.1-py3-none-any.whl
llm-speed verify
llm-speed bench

Methodology: /methodology · Privacy: /privacy · Source: github.com/meadow-kun/llm-speed