NVIDIA is selling a tempting promise: an AI supercomputer in a 150 mm box sitting on your desk. One petaFLOP of compute, 128 GB of unified memory, models of up to 200 billion parameters running locally. All for the price of a good gaming PC. On paper, the DGX Spark — and its consumer cousin the RTX Spark, unveiled at Computex 2026 — looks like the ultimate democratization of local AI.
Except the marketing never tells you where it falls down. And it falls down at a very specific spot. This article breaks down what the GB10 Grace Blackwell chip actually does, with numbers and benchmarks, so you don't drop 4,700 EUR based on a spec sheet.
GB10 Grace Blackwell: what's in the box?
The DGX Spark is built around the NVIDIA GB10 Grace Blackwell Superchip, an SoC that fuses an ARM CPU and a Blackwell GPU onto the same substrate, linked by NVLink-C2C (5x the bandwidth of PCIe Gen 5). In concrete terms:
- CPU: 20 ARMv9 cores — 10 Cortex-X925 (performance) + 10 Cortex-A725 (efficiency)
- GPU: Blackwell architecture, 6,144 CUDA cores, 5th-generation Tensor Cores with native FP4 support
- AI compute: up to 1,000 TOPS / 1 petaFLOP in FP4 (with sparsity)
- Memory: 128 GB LPDDR5x unified (coherently shared between CPU and GPU)
- Storage: up to 4 TB NVMe
- Form factor: 150 x 150 mm, 1.2 kg, 240 W power supply
- OS: DGX OS, a lightly customized Ubuntu 24.04, with the full CUDA / NVIDIA AI stack preinstalled
The big idea is unified memory: no costly copying between system RAM and VRAM. The GPU addresses all 128 GB directly. That's what makes it possible to load models that would blow up any consumer card — an RTX 5090 tops out at 32 GB.
DGX Spark vs RTX Spark: one chip, two worlds
At Computex 2026, NVIDIA broadened the lineup with the RTX Spark, its consumer variant. The confusion is understandable, so let's be clear: the silicon is the same (GB10, sometimes called N1X on the consumer side). What changes is the use case and the ecosystem:
- DGX Spark — aimed at AI developers and researchers. Runs DGX OS (Ubuntu), ships with the data science / CUDA stack. It's a prototyping workstation.
- RTX Spark — aimed at consumers and high-end workstations. Runs Windows, and also ships as notebooks. NVIDIA is targeting the premium PC market with an ARM + RTX GPU chip.
In other words: same engine, different bodywork. If you do serious AI dev work, it's the DGX Spark on Linux. If you want a versatile ARM/RTX machine on Windows, it's the RTX Spark.
The promise: 200 billion parameters on the desktop
NVIDIA's knockout argument is capacity. With 128 GB unified, the Spark loads models of up to 200 billion parameters locally. And if that's not enough, you can link two units directly through their ConnectX-7 network cards (200 Gbps cable, no switch) to create a 256 GB pool and target models of 405 billion parameters.
For anyone who has ever tried to run a large LLM locally, that's enticing. But loading a model and running it fast are two very different things.
The bandwidth wall (the number nobody puts up front)
Here's THE number to remember, the one that determines everything: the GB10's LPDDR5x memory delivers 273 GB/s of bandwidth (256-bit interface, LPDDR5x at 8,533 MT/s). That sounds huge. In this context, it isn't.
Compare:
- DGX Spark (unified LPDDR5x): 273 GB/s
- RTX 5090 (GDDR7): ~1,790 GB/s
- NVIDIA H100 (HBM3): 3,350 GB/s — i.e. 12x more
Why does this matter so much? Because token generation (decode) is limited by memory bandwidth, not by compute. To produce each token of a dense model, the GPU has to read all of the model's weights. A dense 49-billion-parameter model in FP8 means ~25 GB read per token — that's 91% of the 273 GB/s budget consumed for a single sequence.
The result on real benchmarks (sources: LMSYS, Tom's Hardware, ProXPC):
- Llama 3.1 70B (FP8): ~2.7 tokens/s on decode. Usable for batch work or testing prompts, not for a real-time conversation.
- GPT-OSS 20B (MXFP4, Ollama): 49.7 tok/s on decode — versus 205 tok/s on an RTX 5090 (~4x faster).
- GPT-OSS 120B: ~11.7 tok/s on the Spark, versus ~60 tok/s on a Mac Studio M4 Max.
- Llama 3.1 8B: here the Spark does well — 20.5 tok/s at batch 1, climbing to 368 tok/s at batch 32.
The message is clear: on a large dense model in single-stream, the Spark is slow. The marketing talks about 1 petaFLOP; the reality of decode is that bandwidth is in charge.
Capacity vs throughput: understanding the real trade-off
Tom's Hardware sums it up perfectly with the image of a triangle: compute, bandwidth, memory capacity. The Spark sacrifices bandwidth to deliver decent compute and, above all, enormous capacity. It's a deliberate trade-off.
The Spark's value, then, is not serving a model fast. It's being able to load gigantic models that simply won't fit on a consumer GPU, accepting slow decode as the trade-off. It's a development sandbox, not a production inference server.
The angles that save the day
Fortunately, single-stream isn't the whole story. Three things rebalance the picture:
1. MoE (Mixture-of-Experts) models change the game
A dense model reads all of its weights for every token. A MoE model activates only a fraction of its parameters. GPT-OSS, for instance, has 128 experts of which only 4 are active per token: you read ~5 billion parameters instead of 120. The bytes read per token drop from ~25 GB to ~2.5 GB. With the same bandwidth budget, you get 10x more tokens. On the Spark, favoring MoE architectures isn't a detail — it's the strategy.
2. Concurrency (batching) reveals the hidden throughput
A single-stream critique of 2.7 tok/s gives a misleading picture. Send it 256 concurrent streams and the same machine reaches ~695 tokens/s of aggregate throughput. The Spark isn't built to serve one user, it's built to process volume: synthetic dataset generation, model evaluation, batch pipelines.
3. Prototyping, fine-tuning and CUDA compatibility
This is where the Spark really shines: a complete, well-supported CUDA environment where you can prototype locally with real large models before deploying to a cloud cluster. No API bill ballooning during your iterations, no data leaving your premises. A CES 2026 update on top: NVIDIA claims up to 2.5x performance via TensorRT-LLM and speculative decoding.
Against the competition: what to choose?
At a comparable budget (~4,000 to 5,000 EUR), the Spark isn't alone:
- Mac Studio M3 Ultra (512 GB): bandwidth >800 GB/s (3x the Spark) and enough capacity for a Llama 3 405B on a single machine. No hardware FP4, but often faster at token generation. The most serious competitor for pure local AI.
- 2x RTX 5090 rig: unbeatable bandwidth, but only 64 GB of VRAM and incomparable power draw/footprint. Excellent for models that fit in VRAM, useless beyond that.
- Cloud (H100/H200 on demand): unbeatable on raw throughput, but a recurring cost and data that leaves your infrastructure.
If capacity + the CUDA ecosystem + privacy come first: Spark. If generation speed comes first: Mac Studio Ultra or a discrete GPU. If you want to dig deeper, see our comparison Mac Studio M4 Max vs M3 Ultra for local AI and our pick of LLM models to run locally in 2026.
Price and availability
The Founder's Edition launched at $3,999. But the surge in memory prices pushed it to ~$4,699 in early 2026. Beyond NVIDIA's own edition, several manufacturers offer their GB10 variants: ASUS Ascent GX10, Dell, HP, Lenovo, Acer Veriton AI and MSI EdgeXpert (configs up to 4 TB NVMe). Available at Micro Center, Newegg, Best Buy and the NVIDIA marketplace.
Verdict: who is it really for?
Buy it if you're an AI developer or researcher, you want to prototype locally with large models (ideally MoE), CUDA compatibility and privacy matter to you, and you work in batches rather than real-time conversation.
Walk away if you're after maximum generation speed for interactive use, or if your models already fit on a consumer GPU — in which case an RTX 5090 will be far faster for far less money.
The DGX Spark / RTX Spark is neither the scam denounced by those disappointed with single-stream, nor the magical supercomputer of the marketing. It's a niche tool, honest about what it does once you understand bandwidth: democratizing local access to very large models, at the cost of modest throughput. For the local-AI ecosystem — the one that wants to keep its data at home and stop paying for the API by the token — it's one more relevant piece in a puzzle that's growing fast.
Comments