Intelligence Artificielle 18/06/2026 6 min read

NPU, TOPS and AI: what these neural chips really do (and why TOPS lie about LLMs)

AMD XDNA 2, Apple Neural Engine, Intel NPU 5: these chips advertise 38 to 55 TOPS, yet a 7B LLM generates text at only ~7 tok/s on them. A breakdown of what an NPU is actually good for, and what belongs on the GPU.

For two years now, every laptop processor has shipped with its marketing number: 50 TOPS from AMD, 38 TOPS from Apple, ~50 TOPS from Intel. You are sold an "AI PC", a "Copilot+ PC", a machine ready for local artificial intelligence. Then you actually run an LLM on it, and reality bites: the NPU is almost useless for generating text. This article explains what an NPU really does, why TOPS are a misleading metric for LLMs, and where the line falls between the NPU and the GPU.

What an NPU actually is

An NPU (Neural Processing Unit) is an accelerator dedicated to a single operation: low-precision matrix multiplication (INT8, INT4, sometimes BF16). It is a fixed-function block etched next to the CPU and GPU on the same die. Its reason to exist is not raw speed but power efficiency: AMD claims its XDNA 2 is up to 35x more efficient than the CPU on the same models, running at a fraction of a GPU's power draw.

The three major players in 2026:

AMD XDNA 2 (Strix Point, Strix Halo, Ryzen AI 400): up to 50 to 55 TOPS INT8, with hardware primitives for softmax, layer norm and KV-cache streaming.
Apple Neural Engine (M4): 38 TOPS, identical across M4, M4 Pro and M4 Max. Apple claims 60x the performance of the first ANE in the A11 Bionic (2017).
Intel NPU 5 (Panther Lake, Core Ultra Series 3, launched at CES 2026): ~48 to 50 TOPS. Mind the name: "NPU 5" means 5th generation, not 5 TOPS. Intel deliberately shrank the block rather than chasing higher TOPS.

TOPS, or the art of measuring a peak you never reach

One TOPS is one trillion operations per second. The trap: this number is a theoretical peak, measured under ideal conditions where the hardware is compute-saturated. LLM inference looks nothing like that.

A language model runs in two radically different phases:

Prefill (prompt processing): all input tokens are known in advance, so the math is a big matrix x matrix multiply. This is compute-bound: thousands of compute units light up, utilization hits 90-95%. Here, TOPS genuinely matter. This is the phase that drives time-to-first-token (TTFT).
Decode (token-by-token generation): the model produces one word at a time. Each token requires a matrix x vector multiply, meaning it reads the entire weight matrix just to compute against a single vector. The arithmetic finishes in a fraction of a millisecond, then the units sit idle waiting for the next weights to arrive from memory. This is memory-bound.

This is the infamous "memory wall". For a 7B model in FP16, you must stream 14 GB of weights per token across the memory bus. The read takes 10 to 50x longer than the math itself. The result: during decode, the NPU's huge compute array is starved, spinning idle. TOPS buy you nothing.

The most telling figure comes from a Hailo-10H NPU rated at 40 TOPS: in real LLM generation, researchers measured 6.9 tok/s. Far, far below the advertised peak. On Apple Silicon the logic is the same: the ANE's 38 TOPS are identical across every M4, yet 70B generation speed varies 4 to 5x between a base M4 and an M4 Max — purely because of the memory bus (546 GB/s on the Max). I cover this in detail in VRAM, RAM and bandwidth: how to figure out if an LLM fits locally.

Why the GPU (almost) always wins for LLMs

If decode is memory-bound, the question is not "how many TOPS" but "how much bandwidth and how much memory". And here the integrated GPU crushes the NPU for three reasons:

The NPU is physically tiny. On an M2 Max, the ANE takes up less than a tenth of the GPU's area. The iPhone NPU and the Mac NPU are the same 16 cores. It has barely grown across generations, while the GPU went from 10 to 40 cores between M4 and M4 Max.
The GPU has access to all the bandwidth. This is exactly AMD's "Hybrid" strategy: the NPU handles prefill (compute-bound, low TTFT) and the iGPU takes decode (bandwidth-bound). On a Ryzen AI Max+ 395, the Radeon 8060S iGPU reaches up to 61 tok/s on client models.
The software ecosystem targets the GPU. llama.cpp, MLX and LM Studio offload to the GPU, not the NPU. NPUs often only accept INT8 and require specific quantization; they handle the dynamic shapes of LLMs poorly.

AMD even demonstrated a 4-bit-quantized Llama 3.1 70B running 2x faster than an RTX 4090 24 GB on a Ryzen AI Max+ 395 — not thanks to the NPU, but thanks to unified memory (up to 128 GB) that can hold a model which never fits in 24 GB of VRAM. That is the whole point of the Strix Halo architecture, which I detail in AMD Strix Halo (Ryzen AI Max+ 395) for local AI. The same unified-memory principle explains why a Mac Studio M4 Max or M3 Ultra is so effective at LLMs, and why the NVIDIA DGX Spark bets everything on memory bandwidth rather than NPU TOPS.

So what is the NPU actually good for?

The NPU is not useless — it is just mis-sold. It excels at continuous, parallelizable, compute-bound workloads, where it stays saturated while sipping power. Concretely:

Windows Studio Effects: background blur, eye-contact correction, automatic framing, noise suppression. Intel claims up to 9 hours of Teams video calls with effects on — precisely because it runs on the NPU rather than the GPU.
Copilot+: Recall, Live Captions, Cocreator, super resolution. Microsoft mandates 40+ TOPS on the NPU alone (not by adding CPU+GPU+NPU) specifically so these always-on features run without draining the battery or saturating the GPU.
Real-time perception: offline transcription/translation (44+ languages), detection, background image generation.

The common thread: these are parallel streams that keep the compute units busy, where energy efficiency and privacy (everything stays local) matter more than raw speed. That is exactly the territory where a GPU would be a waste of power.

Verdict: don't pay for TOPS you will never use

If your goal is running local LLMs, the NPU is almost beside the point. Look instead at:

Memory bandwidth (256, 400, 546 GB/s...): this is what dictates your throughput in tok/s.
Unified memory size: 32, 64, 128 GB decides which models you can hold.
GPU/iGPU power and the maturity of its stack (ROCm, Metal/MLX, OpenVINO).

The NPU, meanwhile, earns its keep for battery life: clean video calls all day, live captions, Copilot+ with no heat or fan spin. It is an excellent perception coprocessor, not a text-generation engine. To pick your models once you have the right hardware, see the best local LLM models of 2026 on Ollama. Next time a vendor waves a TOPS figure to sell you "AI", you will know exactly what it is worth — and what it is not.

Further reading: Strix Halo / Ryzen AI Max+ 395 · Sizing VRAM/RAM for a local LLM · Mac Studio M4 Max vs M3 Ultra · NVIDIA DGX Spark · Local LLM models 2026

Did you enjoy this article?

Comments

Morgann Riu

Cybersecurity and Linux administration expert. I help companies secure and optimize their critical infrastructures.

Contact me

NPU TOPS Ryzen AI Apple Neural Engine Intel Panther Lake local AI LLM Copilot+

Back to the blog

What an NPU actually is

TOPS, or the art of measuring a peak you never reach

Why the GPU (almost) always wins for LLMs

So what is the NPU actually good for?

Verdict: don't pay for TOPS you will never use

Comments

Recommended for you

Quantification GGUF : Q4_K_M, Q5_K_M, Q6_K ou Q8_0 — comment choisir sans casser la qualité

RAG local avec Ollama : un assistant qui lit VOS documents, 100% hors-ligne

Runtimes LLM local en 2026 : llama.cpp, Ollama, vLLM, LM Studio, TGI, lequel choisir ?

Fine-tuner un LLM en local avec LoRA et QLoRA : VRAM, datasets et attentes réalistes

Related tutorial

Go further

Checklist Sécurité Linux