- The Flash-MoE engine enables running 400B-parameter AI models on an iPhone 17 Pro, defying traditional memory constraints.
- Initial speed is just 0.6 tokens per second, but optimizations double it to 1.1 tokens/sec with minimal quality loss.
- This breakthrough builds on Apple's 2023 research and automated AI methodologies, showcasing academia-community collaboration.
- It could spur private, offline AI, reducing cloud reliance and validating hardware architectures like unified memory.
The iPhone 17 Pro's 12 GB of unified memory seemed an insurmountable barrier for massive artificial intelligence models. Conventionally, running a system with hundreds of billions of parameters locally demanded tens of gigabytes of RAM and specialized hardware. Yet, a software engineering breakthrough has shown the impossible is now technically feasible, albeit at speeds that test one's patience.
This technical breakthrough redefines what's possible in on-device AI, potentially democratizing access to advanced models and enhancing privacy, impacting developers, hardware firms, and end-users.
The Engine That Made It Happen
Developer Daniel Woods, known as @dandeveloper, created an open-source inference engine called Flash-MoE. Published on GitHub alongside a detailed study, this system leverages an optimized Mixture of Experts (MoE) architecture. Initially, Woods ran the full, uncompressed Qwen 3.5 397B model on a MacBook Pro with 48 GB of RAM. The model, occupying 209 GB on disk, worked, setting a critical precedent.
The developer community quickly pushed boundaries further. Others managed to run even larger models, such as DeepSeek-V3 with 671 billion parameters and Kimi K2.5 with a staggering one trillion parameters, on similar MacBook hardware. Inference speeds were notably slow, but the mere fact they functioned marked a milestone in decentralized AI computing.
An iPhone with 12 GB of RAM runs a 400B model, redefining the boundaries of on-device AI.
The iPhone Test
Inspired by these achievements, another developer under the alias Anemll took the experiment to the extreme: attempting to run the Qwen 3.5 397B model on an iPhone 17 Pro with its 12 GB of unified memory. Against all odds, the model executed, producing responses at a mere 0.6 tokens per second. This rate is nearly unusable for practical applications, but the technical demonstration is profound.
Subsequently, Anemll optimized the approach by reducing the number of "experts" in the MoE architecture to four, doubling the speed to 1.1 tokens per second with an estimated 2.5% loss in response quality. Meanwhile, another user ran a smaller model, Qwen 3.5 35B, on the same iPhone, achieving a much more usable speed of 13.1 tokens per second. These experiments showcase a spectrum of trade-offs between model size, speed, and quality.
Historical Context and Methodology
This advancement doesn't come out of thin air. Three years ago, Apple researchers published a study titled "LLM in a flash," proposing to use not just the unified memory of Apple devices but also their internal storage to run large AI models. The idea was to circumvent RAM limitations through efficient memory swapping techniques.
Woods applied this methodology using advanced tools like Claude Code with the Claude Opus 4.6 model and adopted the "autoresearch" approach popularized by Andrej Karpathy. This automated AI research method helped implement Flash-MoE, demonstrating how collaboration between academic research and community development can yield technological leaps.
Implications for the Future of AI
The ability to run gigantic models on modest hardware has significant ramifications. First, it challenges the narrative that advanced AI is permanently tethered to the cloud and massive data centers. Companies like GLM and other players in the open-source AI space could see accelerated adoption of their models if hardware barriers diminish.
“Markets are always looking at the future, not the present.”
— Xataka
Second, this could spur a new wave of truly private, offline AI applications, attracting users concerned about data privacy. Finally, for the hardware industry, especially Apple, it validates the unified memory architecture and could influence future design decisions, though current speed remains a critical bottleneck requiring ongoing innovation in software and chips.