Qwen 3.6: 35B vs 27B comparison - benchmark results
I finally summed up all the Qwen 3.6 model test results I gathered over the past few days. I compared two models in detail: the Qwen3.6-35B-A3B (MoE, hybrid attention/delta) and the Qwen3.6-27B (dense, hybrid attention/delta). I ran both with turbo3 KV cache compression on an RTX 4090 as a llama.cpp server.
If I had to summarize briefly: the 35B-A3B is 3-4x faster in everything, but the 27B delivers better quality. This is the classic MoE vs. dense tradeoff, just backed by numbers.
Architecture - what’s the difference?#
Both models come from the same Qwen3.6 family, both use a hybrid Mamba/attention architecture, but their approach is completely different:
35B-A3B:
- 35B total parameters, but only 3B active per token
- 40 layers: 10 × (3× Gated DeltaNet → MoE) + 1 × (Gated Attention → MoE) per block
- Only 10 full attention layers (GQA, 16Q/2KV, 256-dimensional)
- 30 Gated DeltaNet layers (recurrent, no KV cache)
- MoE routing: 8+1 shared expert out of 256 experts
- Native context: 262,144 tokens
27B Dense:
- 27B total parameters, 27B active per token (every parameter is activated)
- 64 layers: 16 × (3× Gated DeltaNet → FFN) + 1 × (Gated Attention → FFN) per block
- Only 16 full attention layers (GQA, 24Q/4KV, 256-dimensional)
- 48 Gated DeltaNet layers (recurrent, no KV cache)
- Dense FFN - no MoE, everything computed every token
- Native context: 262,144 tokens
The key point: the 35B-A3B calls only 9 out of 256 experts per token, while the 27B has to process all its parameters. This means 9x less computation per token.
Needle-In-Haystack (NIAH) - long context retrieval#
This test measures whether the model can find a key piece of information in a huge text. The needle is a sentence I hid somewhere in the haystack, and the model has to retrieve it.
35B-A3B (IQ4_XS quant, turbo3 KV cache):
| Context | 0% | 5% | 25% | 50% | 75% | 100% |
|---|---|---|---|---|---|---|
| 4k | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
| 8k | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
| 16k | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
| 32k | 1.0 | - | 1.0 | 1.0 | - | 1.0 |
| 64k | 1.0 | - | 1.0 | 1.0 | - | 1.0 |
| 128k | 1.0 | - | 1.0 | 1.0 | 1.0 | 1.0 |
| 200k | 1.0 | - | - | 1.0 | - | 1.0 |
Total: 100% (74/74 tests) - perfect at every context length and depth position.
27B (UD-Q5_K_XL quant, turbo3 KV cache):
| Context | 0% | 5% | 25% | 50% | 75% | 100% |
|---|---|---|---|---|---|---|
| 4k | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
| 8k | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
| 16k | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
| 32k | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
| 64k | 1.0 | - | 1.0 | 1.0 | 1.0 | 1.0 |
| 100k | 1.0 | - | 1.0 | 1.0 | 1.0 | 1.0 |
| 130k | 1.0 | - | 1.0 | 1.0 | 1.0 | 1.0 |
Total: 100% (78/78 tests) - also perfect everywhere.
Verdict: Both models produce perfect NIAH results, with no degradation observed under turbo3 KV cache quantization. I tested the 35B-A3B to higher context lengths (200k vs 130k), but the 27B also gave a stable 100% at all tested points.
Token generation speed - the big difference#
This is where the MoE vs. dense difference becomes truly impressive. On an RTX 4090 with turbo3 KV cache:
Decode (token generation):
| Context | 35B-A3B (tok/s) | 27B (tok/s) | Ratio |
|---|---|---|---|
| Short ctx (peak) | 161.8 | 40.3 | 4.0× |
| 4k | 152.6 | 38.7 | 3.9× |
| 8k | 142.7 | 36.5 | 3.9× |
| 16k | 122.2 | 32.4 | 3.8× |
| 32k | 96.0 | 28.1 | 3.4× |
| 64k | 65.4 | 18.6 | 3.5× |
| 100k+ | ~55 (estimated) | 16.6 | ~3.3× |
Prefill (prompt processing):
| Context | 35B-A3B (tok/s) | 27B (tok/s) | Ratio |
|---|---|---|---|
| Peak | 5912 (at 4k) | 2620 (at 4k) | 2.3× |
| 4k | 5912 | 2620 | 2.3× |
| 16k | 5441 | 2610 | 2.1× |
| 32k | 5271 | 2331 | 2.3× |
| 64k | 4688 | 1959 | 2.4× |
| 100k | ~4200 (estimated) | 1733 | 2.4× |
What does this mean in practice?
The 35B-A3B is 3.5-4x faster at token generation. This means if the 27B takes 10 seconds to write a paragraph, the 35B-A3B does the same in 2.5 seconds. The picture is similar for prefill: 2-2.4x faster prompt processing.
The interesting thing is that the 35B-A3B has more total parameters overall (35B vs 27B), but is still much faster because it only needs to load and process 3B per token due to MoE routing. The 27B has to process all its parameters every token.
VRAM and context capacity#
This is where the 35B-A3B really shines:
| 35B-A3B (IQ4_XS) | 35B-A3B (Q4_K_S) | 27B (UD-Q5_K_XL) | |
|---|---|---|---|
| Model size | 17.7 GB | 20.9 GB | 18.65 GB |
| VRAM idle | ~20400 MB | ~22700 MB | ~22900 MB |
| Max context | 262k | 188k | 156k |
| VRAM free at max | ~3600 MB | ~1300 MB | ~1134 MB |
Important note: the 35B-A3B’s 262k context is achievable with IQ4_XS quantization. With the same Q4_K_S quantization (which is closer to the 27B’s UD-Q5_K_XL), the max is 188k. This is still more than the 27B’s 156k, but not as dramatic as 262k.
The key point: the 35B-A3B’s MoE architecture consumes less VRAM per token because it only needs to load 3B parameters during the forward pass. This leaves more room for the KV cache.
The tests are carried not just by turbo3 KV cache compression, but also by the Sparse V (attention-gated value dequantization) technique. This is an optimization that filters out positions in the flash attention kernel where the attention weight is negligible (below 10⁻⁶), and doesn’t perform V (value) dequantization there. The key: instead of trying to speed up every position’s dequantization (which hits hardware limits), we simply skip the unnecessary operations. At long context, this can mean up to 22.8% decode speed improvement, without any quality loss. And the best part: this technique is not turboquant-specific, but works on any quantized KV cache format (q8_0, q4_0, turbo3), because it’s based on the attention distribution, not the dequant mechanism.
Quality - where the 27B wins#
Based on Qwen’s official benchmark results, the 27B outperforms the 35B-A3B on every single benchmark:
| Benchmark | 35B-A3B | 27B | Winner |
|---|---|---|---|
| SWE-bench Verified | 73.4 | 77.2 | 27B |
| SWE-bench Pro | 49.5 | 53.5 | 27B |
| Terminal-Bench 2.0 | 51.5 | 59.3 | 27B |
| SkillsBench | 28.7 | 48.2 | 27B (+19.5!) |
| MMLU-Pro | 85.2 | 86.2 | 27B |
| GPQA Diamond | 86.0 | 87.8 | 27B |
| AIME 2026 | 92.7 | 94.1 | 27B |
| LiveCodeBench v6 | 80.4 | 83.9 | 27B |
| HLE | 21.4 | 24.0 | 27B |
| QwenWebBench | 1397 | 1487 | 27B |
The special standout is SkillsBench: +15.5 points in favor of the 27B. This is a benchmark specialized for coding agent tasks, and there the 27B is unbeatable. The difference is similarly large in Terminal-Bench (+7.8) and SWE-bench (+3-4).
Summary - which one should I choose?#
For the 35B-A3B:
- 🚀 3-4x faster generation
- 📏 More context (188k-262k vs 156k)
- Great for RAG, long conversations, high throughput
- Quality: very good, almost on the same level as the 27B
For the 27B Dense:
- 🏆 Better quality on every benchmark
- Especially better for coding agent tasks (SkillsBench: +19.5)
- Slower, but if quality is the goal, the wait is worth it
- Less context (156k max)
My practical advice:
If you’re running RAG, managing long context conversations, or simply want answers to come fast - 35B-A3B. If you’re running a coding agent, solving complex logical tasks, or quality is the most important thing - then 27B.
Both models work wonderfully with turbo3 KV cache compression, and both achieved 100% on NIAH. Turbo3 doesn’t degrade anything - in fact, due to VRAM savings, there was no need to compromise on context length at all.