V3

Name: V3 — community LLM benchmarks
Creator: llm-speed
License: https://www.apache.org/licenses/LICENSE-2.0
Keywords: V3, LLM benchmark, tokens per second, decode tok/s, prefill, TTFT

No benchmarks for V3 yet.

No V3 benchmarks yet.

Run on your hardware to populate this page: pipx install llm-speed && llm-speed bench

$ pipx install llm-speed && llm-speed bench

read the methodology

Community folklore

30 unverified claims extracted from Reddit/HN comments. Lower trust than signed runs above — every row links to the source.

communityconfidence 75%
5.00tok/s — deepseek-V3 on Together AI via hosted-api FP4
“VRAM, higher vRAM might helpful for long-context, PCIe5 might helpful for prefill. DDR4 Epyc or Icelake-SP Xeons with GPU may only reach \~5t/s(from my deepseek-V3 experience), which is pretty slow. - Serious one: 4x RTX Pro 6000 Blackwell together with modern Xeon or Epyc shoul…”
source: Reddit · u/lly0571 · 2025-09-22
communityconfidence 65%
10.00tok/s — DeepSeek V3 on RTX 4090 FP8
“g fast. Expert weights on CPU (4-bit GGUF): All the huge MoE banks live in system RAM and load as needed. End result: I’m seeing about \~10 tokens/sec with a 32K context window—pretty smooth for local tinkering. KTransformers made it so easy with its Docker image. It handles …”
source: Reddit · u/texasdude11 · 2025-04-27
communityconfidence 65%
10.00tok/s — DeepSeek V3 on RTX 4090 FP8
“Finally got ~10t/s DeepSeek V3-0324 hybrid (FP8+Q4_K_M) running locally on my RTX 4090 + Xeon with with 512GB RAM, KTransformers and 32K context Hey everyone”
source: Reddit · u/texasdude11 · 2025-04-27
communityconfidence 65%
56.00tok/s — deepseek v3.2 via vllm FP16
“: With the vllm-gfx906-mobydick fork, you can also run smaller recent models (as the base is vllm v0.17.1) like **Qwen3.5 27B** (reaching **56 tok/s** at MTP5 and TP4 but it fits also on 1 MI50 32GB with 65k context; maybe later, if you are interested in, I can also make another …”
source: Reddit · u/ai-infos · 2026-04-01
communityconfidence 65%
56.00tok/s — deepseek v3.2 via vllm FP16
“: With the vllm-gfx906-mobydick fork, you can also run smaller recent models (as the base is vllm v0.17.1) like **Qwen3.5 27B** (reaching **56 tok/s** at MTP5 and TP4 but it fits also on 1 MI50 32GB with 65k context; maybe later, if you are interested in, I can also make another …”
source: Reddit · u/ai-infos · 2026-04-01
communityconfidence 65%
56.00tok/s — deepseek v3.2 via vllm FP16
“: With the vllm-gfx906-mobydick fork, you can also run smaller recent models (as the base is vllm v0.17.1) like **Qwen3.5 27B** (reaching **56 tok/s** at MTP5 and TP4 but it fits also on 1 MI50 32GB with 65k context; maybe later, if you are interested in, I can also make another …”
source: Reddit · u/ai-infos · 2026-04-01
communityconfidence 60%
25.00tok/s — DeepSeek V3 on H100 via vllm
“other developers, unlimited tokens Running DeepSeek V3 (685B) requires 8×H100 GPUs which is about $14k/month. Most developers only need 15-25 tok/s. sllm lets you join a cohort of developers sharing a dedicated node. You reserve a spot with your card, and nobody is charged until…”
source: HN · u/jrandolf · 2026-04-04
communityconfidence 60%
10.85tok/s — DeepSeek-V3.1 on RTX 5090 via llama.cpp
“gh the system popping out through a small gap left by RTX 5090). DeepSeek-V3.1-Terminus with context = 37279 tokens: PP = 151.76 tps, TG = 10.85 tps Some things I discovered running local LLMs: * For water-cooled CPU systems, there is not enough air circulation to cool the RAM…”
source: Reddit · u/sloptimizer · 2026-01-23
communityconfidence 60%
151.8tok/s — DeepSeek-V3.1 on RTX 5090 via llama.cpp
“is threaded through the system popping out through a small gap left by RTX 5090). DeepSeek-V3.1-Terminus with context = 37279 tokens: PP = 151.76 tps, TG = 10.85 tps Some things I discovered running local LLMs: * For water-cooled CPU systems, there is not enough air circulatio…”
source: Reddit · u/sloptimizer · 2026-01-23
communityconfidence 60%
10.00tok/s — DeepSeek V3 on RTX 4090 via llama.cpp
“nt=share_button) \- DeepSeek V3-0324 Q4\_K\_M with 512GB RAM DDR5 RAM 4800MHz and one RTX 4090; Prompt eval speed: 40 t/s; generation: \~10 t/s; DDR5 with 8 channels certainly improves the speed but it is not blazing fast. 3. If you have more VRAM, you can try --ubatch-size …”
source: Reddit · u/MLDataScientist · 2025-06-01

See all 30 claims for V3 →