Apple confirms up to 4x MLX AI performance boost with macOS 26.2

Those time to first token: Shorter bars are good. Very good.
Did you miss the latest fascinating post from Apple’s Machine Learning Research team?
It should be valuable reading for anyone thinking about using Macs to deliver on-prem AI, and really shows the massive performance boost you can expect from M5 Macs when running LLMs on your device.
The Mac is already a tool of choice for AI developers and researchers, in part thanks to MLX, Apple’s open-source tool to let people explore and run LLMs on Macs. MLX works with all Apple Silicon systems, including the all-new M5 MacBook Pro.
Coming to a Mac near you
Apple is working on macOS 26.2, which will deliver a real performance boost when running MLX on an M5 Mac, as it now supports the Neural Accelerators on the chip. The impact of that support is significant – the company’s blog reflects this.
The researchers tested several LLMs to get some sense of the speed increase realized as a result of the move. These included e Qwen 1.7B and 8B, in native BF16 precision, and 4-bit quantized Qwen 8B and Qwen 14B models. They also tested two Mixture of Experts (MoE): Qwen 30B (3B active parameters, 4-bit quantized) and GPT OSS 20B (in native MXFP4 precision).
The tests showed that compared to M4 Macs, the M5 could delivered up to 4.1x the performance in terms of time to first token. For example, when running Qwen3-14B-MLX 4-bit, that time shrank from c.36-second to around eight seconds, a massive improvement.
Under 3-seconds for some tasks
“The M5 pushes the time-to-first-token generation under 10 seconds for a dense 14B architecture, and under 3 seconds for a 30B MoE, delivering strong performance for these architectures on a MacBook Pro,” they said.
“Generating subsequent tokens is bounded by memory bandwidth, rather than by compute ability. On the architectures we tested in this post, the M5 provides 19-27% performance boost compared to the M4, thanks to its greater memory bandwidth,” the also added. Similarly, generating a 1024×1024 image with FLUX-dev-4bit (12B parameters) with MLX is more than 3.8x faster on a M5 than it is on a M4.
This is all highly significant, and in conjunction with the move to make it possible to daisy chain Macs using a Thunderbolt 5 cable to create ad hoc Mac clusters, really reinforces the argument that the best tools to build or run AI are Macs.
Follow me on social media! Join me on BlueSky, LinkedIn, and Mastodon.