Intelligence Artificielle 17/06/2026 5 min read

AMD Strix Halo (Ryzen AI Max+ 395) for Local AI: the €1,500 DGX Spark killer?

128 GB of unified memory, an RDNA 3.5 iGPU and a price three times lower than the DGX Spark. We break down the real LLM benchmarks of the Ryzen AI Max+ 395: actual tokens/s in dense and MoE, the 256 GB/s wall, and who this APU is really worth it for.

When NVIDIA shipped its €4,000 DGX Spark, AMD was preparing its answer in the same arena: running large models locally thanks to massive unified memory. Except the Ryzen AI Max+ 395 — codenamed Strix Halo — arrives with a brutal argument: ~€1,500, nearly three times cheaper. On paper, it's the Spark killer.

As always, the spec sheet lies by omission. This article breaks down what this APU actually does at LLM inference, with community benchmarks, to tell you whether it deserves your money — or not.

What's inside the chip?

The Ryzen AI Max+ 395 is a SoC built on TSMC 4nm that fuses CPU, GPU and NPU onto the same die, with a shared memory pool. Concretely:

CPU: 16 Zen 5 cores
Integrated GPU: 40 RDNA 3.5 compute units (codenamed gfx1151)
NPU: XDNA 2, ~50 TOPS INT8
Memory: 128 GB LPDDR5X-8000 unified, shared between CPU and GPU
Memory bandwidth: ~256 GB/s

The idea is exactly the same as NVIDIA's: unified memory eliminates the costly copy between system RAM and VRAM, and lets you load models that would blow up any consumer card. An RTX 5090 tops out at 32 GB; here, you have room for a dense 70-billion-parameter model without any acrobatics.

One crucial detail for anyone chasing performance: GPU allocation depends on the OS. On Linux you can assign up to 110 GB directly to the GPU; on Windows, AMD's Variable Graphics Memory caps at 96 GB. For serious local AI, it's Linux.

The real numbers: tokens/s in practice

Community benchmarks (under llama.cpp with a Vulkan or ROCm backend) paint a nuanced picture, and that's where everything plays out.

Small models (≤ 3B): excellent

A 3B in Q4 climbs from ~28 to ~93 tokens/s in decode depending on context. Quantization is a direct multiplier on unified memory: fewer bits, more throughput. For light coding assistance or RAG, it's smooth.

Mid-size models (10–30B): the variable zone

This is the "useful but uneven" segment: some 20–30B models stay pleasant (9 to 20 tok/s), others collapse to 4 tok/s. Architecture matters more than parameter count here.

Large dense models (70B): it loads, but it crawls

A 70B in BF16 loads in full without sharding — impossible on a consumer card — but inference tops out at ~14 tok/s. That's not a model flaw: it's the inherent cost of dense generation, throttled by bandwidth.

MoE models: the real playground

This is where the big memory makes sense. Qwen3 235B, a Mixture-of-Experts that activates only ~22 billion of its 235 billion parameters per pass, runs at ~11 tok/s with a first token in 0.03 s. You're moving the active weights, not the whole model. On this class of hardware, favoring MoE architectures isn't a detail, it's the strategy.

The bandwidth wall (again)

Here's the number that explains everything: ~256 GB/s. That's almost identical to the DGX Spark (273 GB/s), and it's 4 to 7 times less than a discrete GPU (RTX 5090: ~1,790 GB/s).

Why does it matter? Because token generation is bound by memory bandwidth, not compute. To produce each token of a dense model, the GPU must re-read every weight. For comparison, an RTX 4090 does 127 tok/s on an 8B versus ~48 tok/s for the Strix Halo. The discrete card crushes the APU on raw speed — but it simply cannot load the models the APU swallows without flinching. That's the whole trade-off.

Good news on the prefill side (prompt reading), though: thanks to memory running near its peak rate, some models reach ~1,190 tok/s on read. For a RAG pipeline ingesting a 50-page document, the read phase is dispatched fast.

Strix Halo vs DGX Spark: the matchup

Both play in the same category — big unified memory, modest bandwidth, local AI — but with opposite philosophies:

Architecture: Strix Halo is x86 (Zen 5 + RDNA 3.5), the DGX Spark is ARM (Grace + Blackwell).
Ecosystem: NVIDIA has CUDA, mature and universal. AMD relies on ROCm / Vulkan, clearly improving but still rougher to set up.
Hardware FP4: the Spark's Blackwell handles it natively, RDNA 3.5 does not.
Price: ~€1,500 versus ~€4,000–4,700. That's AMD's number-one argument.

Cross verdict: if you want the best capacity/price ratio to tinker with large MoE models locally and you're not afraid of Linux + ROCm, the Strix Halo is unbeatable. If you depend on CUDA or FP4 for your workflow, the Spark's premium is justified.

Where to buy it?

The Ryzen AI Max+ 395 ships in several turnkey machines: the Framework Desktop (the most open and documented for tinkering), mini-PCs like the GMKtec EVO-X2, and various high-end ultraportables. For local AI, aim strictly for the 128 GB configuration — the 32/64 GB variants kill the whole point of unified memory.

Verdict: who is it really for?

Buy it if you want to run large models (ideally MoE) locally without breaking the bank, data privacy matters to you, you mostly work in batch or RAG, and Linux doesn't scare you.

Skip it if you're after maximum generation speed for interactive use (a discrete GPU will be far faster), if your models already fit in 24–32 GB of VRAM, or if you can't do without CUDA.

The Strix Halo is neither the "NVIDIA killer" miracle of clickbait headlines nor a gimmick. It's an honest APU that democratizes local access to large models at the cost of a modest throughput — and does it for three times less than the competition. In the fast-growing local-AI ecosystem, it's a serious piece.

Further reading: our analysis of NVIDIA's DGX Spark, and our guide to calculating the VRAM needed for a local LLM.

Did you enjoy this article?

Comments

Morgann Riu

Cybersecurity and Linux administration expert. I help companies secure and optimize their critical infrastructures.

Contact me

AMD Strix Halo Ryzen AI Max+ 395 Local AI LLM RDNA 3.5 unified memory llama.cpp

Back to the blog

What's inside the chip?

The real numbers: tokens/s in practice

Small models (≤ 3B): excellent

Mid-size models (10–30B): the variable zone

Large dense models (70B): it loads, but it crawls

MoE models: the real playground

The bandwidth wall (again)

Strix Halo vs DGX Spark: the matchup

Where to buy it?

Verdict: who is it really for?

Comments

Recommended for you

Quantification GGUF : Q4_K_M, Q5_K_M, Q6_K ou Q8_0 — comment choisir sans casser la qualité

RAG local avec Ollama : un assistant qui lit VOS documents, 100% hors-ligne

Runtimes LLM local en 2026 : llama.cpp, Ollama, vLLM, LM Studio, TGI, lequel choisir ?

Fine-tuner un LLM en local avec LoRA et QLoRA : VRAM, datasets et attentes réalistes

Go further

Checklist Sécurité Linux