V3
No benchmarks for V3 yet.
No V3 benchmarks yet.
Run on your hardware to populate this page: pipx install llm-speed && llm-speed bench
$ pipx install llm-speed && llm-speed bench
Community folklore
30 unverified claims extracted from Reddit/HN comments. Lower trust than signed runs above — every row links to the source.
- communityconfidence 75%
5.00tok/s — deepseek-V3 on Together AI via hosted-api FP4
“VRAM, higher vRAM might helpful for long-context, PCIe5 might helpful for prefill. DDR4 Epyc or Icelake-SP Xeons with GPU may only reach \~5t/s(from my deepseek-V3 experience), which is pretty slow. - Serious one: 4x RTX Pro 6000 Blackwell together with modern Xeon or Epyc shoul…”
- communityconfidence 65%
10.00tok/s — DeepSeek V3 on RTX 4090 FP8
“g fast. Expert weights on CPU (4-bit GGUF): All the huge MoE banks live in system RAM and load as needed. End result: I’m seeing about \~10 tokens/sec with a 32K context window—pretty smooth for local tinkering. KTransformers made it so easy with its Docker image. It handles …”
- communityconfidence 65%
10.00tok/s — DeepSeek V3 on RTX 4090 FP8
“Finally got ~10t/s DeepSeek V3-0324 hybrid (FP8+Q4_K_M) running locally on my RTX 4090 + Xeon with with 512GB RAM, KTransformers and 32K context Hey everyone”
- communityconfidence 65%
56.00tok/s — deepseek v3.2 via vllm FP16
“: With the vllm-gfx906-mobydick fork, you can also run smaller recent models (as the base is vllm v0.17.1) like **Qwen3.5 27B** (reaching **56 tok/s** at MTP5 and TP4 but it fits also on 1 MI50 32GB with 65k context; maybe later, if you are interested in, I can also make another …”
- communityconfidence 65%
56.00tok/s — deepseek v3.2 via vllm FP16
“: With the vllm-gfx906-mobydick fork, you can also run smaller recent models (as the base is vllm v0.17.1) like **Qwen3.5 27B** (reaching **56 tok/s** at MTP5 and TP4 but it fits also on 1 MI50 32GB with 65k context; maybe later, if you are interested in, I can also make another …”
- communityconfidence 65%
56.00tok/s — deepseek v3.2 via vllm FP16
“: With the vllm-gfx906-mobydick fork, you can also run smaller recent models (as the base is vllm v0.17.1) like **Qwen3.5 27B** (reaching **56 tok/s** at MTP5 and TP4 but it fits also on 1 MI50 32GB with 65k context; maybe later, if you are interested in, I can also make another …”
- communityconfidence 60%
25.00tok/s — DeepSeek V3 on H100 via vllm
“other developers, unlimited tokens Running DeepSeek V3 (685B) requires 8×H100 GPUs which is about $14k/month. Most developers only need 15-25 tok/s. sllm lets you join a cohort of developers sharing a dedicated node. You reserve a spot with your card, and nobody is charged until…”
- communityconfidence 60%
10.85tok/s — DeepSeek-V3.1 on RTX 5090 via llama.cpp
“gh the system popping out through a small gap left by RTX 5090). DeepSeek-V3.1-Terminus with context = 37279 tokens: PP = 151.76 tps, TG = 10.85 tps Some things I discovered running local LLMs: * For water-cooled CPU systems, there is not enough air circulation to cool the RAM…”
- communityconfidence 60%
151.8tok/s — DeepSeek-V3.1 on RTX 5090 via llama.cpp
“is threaded through the system popping out through a small gap left by RTX 5090). DeepSeek-V3.1-Terminus with context = 37279 tokens: PP = 151.76 tps, TG = 10.85 tps Some things I discovered running local LLMs: * For water-cooled CPU systems, there is not enough air circulatio…”
- communityconfidence 60%
10.00tok/s — DeepSeek V3 on RTX 4090 via llama.cpp
“nt=share_button) \- DeepSeek V3-0324 Q4\_K\_M with 512GB RAM DDR5 RAM 4800MHz and one RTX 4090; Prompt eval speed: 40 t/s; generation: \~10 t/s; DDR5 with 8 channels certainly improves the speed but it is not blazing fast. 3. If you have more VRAM, you can try --ubatch-size …”