RTX 4090 — LLM benchmarks

Name: RTX 4090 — community LLM benchmarks
Creator: llm-speed
License: https://www.apache.org/licenses/LICENSE-2.0
Keywords: RTX 4090, LLM benchmark, tokens per second, decode tok/s, prefill, TTFT

No benchmarks on RTX 4090 yet.

No RTX 4090 benchmarks yet.

Run on YOUR hardware to populate this page: pipx install llm-speed && llm-speed bench

$ pipx install llm-speed && llm-speed bench

read the methodology

Community folklore on RTX 4090

108 unverified claims extracted from Reddit/HN comments. Lower trust than signed runs above — every row links to the source.

communityconfidence 75%
60.00tok/s — Llama 3.1 70b on RTX 4090 via ollama IQ2_XS
“Llama 3.1 70b at 60 tok/s on RTX 4090 (IQ2_XS) Setup GPU: 1 x RTX 4090 (24 GB VRAM) CPU: Xeon® E5-2695 v3 (16 cores) RAM: 64 GB RAM Running PyTorch 2.2.0 + CUDA 1”
source: Reddit · u/grey-seagull · 2024-09-20
communityconfidence 75%
60.00tok/s — Llama 3.1 70b on RTX 4090 via ollama IQ2_XS
“Llama 3.1 70b at 60 tok/s on RTX 4090 (IQ2_XS) Setup GPU: 1 x RTX 4090 (24 GB VRAM) CPU: Xeon® E5-2695 v3 (16 cores) RAM: 64 GB RAM Running PyTorch 2.2.0 + CUDA 1”
source: Reddit · u/grey-seagull · 2024-09-20
communityconfidence 75%
60.00tok/s — Llama 3.1 70b on RTX 4090 via ollama IQ2_XS
“Llama 3.1 70b at 60 tok/s on RTX 4090 (IQ2_XS) Setup GPU: 1 x RTX 4090 (24 GB VRAM) CPU: Xeon® E5-2695 v3 (16 cores) RAM: 64 GB RAM Running PyTorch 2.2.0 + CUDA 1”
source: Reddit · u/grey-seagull · 2024-09-20
communityconfidence 75%
92.40tok/s — Qwen3-8B on RTX 4090 via llama.cpp Q8_0
“tok/s` generation at `128` output tokens - `Q8_0`: about `9975 tok/s` prompt processing at `512` tokens, `9955 tok/s` at `1024`, and about `92.4 tok/s` generation at `128` output tokens Hardware / runtime for those numbers: - `RTX 4090` - `Ryzen 9 7900X` - `llama.cpp` build com…”
source: Reddit · u/RiverRatt · 2026-03-23
communityconfidence 75%
92.40tok/s — Qwen3-8B on RTX 4090 via llama.cpp Q8_0
“tok/s` generation at `128` output tokens - `Q8_0`: about `9975 tok/s` prompt processing at `512` tokens, `9955 tok/s` at `1024`, and about `92.4 tok/s` generation at `128` output tokens Hardware / runtime for those numbers: - `RTX 4090` - `Ryzen 9 7900X` - `llama.cpp` build com…”
source: Reddit · u/RiverRatt · 2026-03-23
communityconfidence 75%
548.9tok/s — Llama-3.1-70B on RTX 4090 via vllm FP8
“neuralmagic\_Meta-Llama-3.1-70B-Instruct-FP8-dynamic Avg generation throughput: \~29-30 tokens/s Avg prompt throughput: 548.9 tokens/s (4 GPUs, 4090, power limited to 325) (8x, 8x, 4x , 4x) 5950x taichi x570 vllm backend i didn't do the specific prompt to get the va”
source: Reddit · u/I_can_see_threw_time · 2024-08-01
communityconfidence 75%
30.00tok/s — Llama-3.1-70B on RTX 4090 via vllm FP8
“neuralmagic\_Meta-Llama-3.1-70B-Instruct-FP8-dynamic Avg generation throughput: \~29-30 tokens/s Avg prompt throughput: 548.9 tokens/s (4 GPUs, 4090, power limited to 325) (8x, 8x, 4x , 4x) 5950x taichi x570 vllm backend i di”
source: Reddit · u/I_can_see_threw_time · 2024-08-01
communityconfidence 75%
548.9tok/s — Llama-3.1-70B on RTX 4090 via vllm FP8
“neuralmagic\_Meta-Llama-3.1-70B-Instruct-FP8-dynamic Avg generation throughput: \~29-30 tokens/s Avg prompt throughput: 548.9 tokens/s (4 GPUs, 4090, power limited to 325) (8x, 8x, 4x , 4x) 5950x taichi x570 vllm backend i didn't do the specific prompt to get the va”
source: Reddit · u/I_can_see_threw_time · 2024-08-01
communityconfidence 75%
30.00tok/s — Llama-3.1-70B on RTX 4090 via vllm FP8
“neuralmagic\_Meta-Llama-3.1-70B-Instruct-FP8-dynamic Avg generation throughput: \~29-30 tokens/s Avg prompt throughput: 548.9 tokens/s (4 GPUs, 4090, power limited to 325) (8x, 8x, 4x , 4x) 5950x taichi x570 vllm backend i di”
source: Reddit · u/I_can_see_threw_time · 2024-08-01
communityconfidence 70%
140.0tok/s — Qwen3-Coder on RTX 4090 via llama.cpp
“bled, KV cache at Q8) on my 4090. This is fully on the GPU, no offloading to CPU. Depending on context length I'm getting anywhere from 100-140 tokens/sec. If you wanted more context you'd have to offload some layers to CPU and it takes a massive hit (my recent post has some benc…”
source: Reddit · u/ConversationNice3225 · 2025-08-05

See all 108 claims for RTX 4090 →

Common questions about RTX 4090

Direct Q&A drawn from the runs above: fastest LLM, supported model classes, backend rankings, quantization guidance.

Read the RTX 4090 FAQ →