iPhone 17 Pro Runs 400B AI Model Locally: Memory Barrier Shattered

TECH

Key Takeaways

The Flash-MoE engine enables running 400B-parameter AI models on an iPhone 17 Pro, defying traditional memory constraints.
Initial speed is just 0.6 tokens per second, but optimizations double it to 1.1 tokens/sec with minimal quality loss.
This breakthrough builds on Apple's 2023 research and automated AI methodologies, showcasing academia-community collaboration.
It could spur private, offline AI, reducing cloud reliance and validating hardware architectures like unified memory.

Purple iPhone 17 with dual cameras — Photo by appshunter.io on Unsplash

The iPhone 17 Pro's 12 GB of unified memory seemed an insurmountable barrier for massive artificial intelligence models. Conventionally, running a system with hundreds of billions of parameters locally demanded tens of gigabytes of RAM and specialized hardware. Yet, a software engineering breakthrough has shown the impossible is now technically feasible, albeit at speeds that test one's patience.

Why It Matters

This technical breakthrough redefines what's possible in on-device AI, potentially democratizing access to advanced models and enhancing privacy, impacting developers, hardware firms, and end-users.

The Engine That Made It Happen

Developer Daniel Woods, known as @dandeveloper, created an open-source inference engine called Flash-MoE. Published on GitHub alongside a detailed study, this system leverages an optimized Mixture of Experts (MoE) architecture. Initially, Woods ran the full, uncompressed Qwen 3.5 397B model on a MacBook Pro with 48 GB of RAM. The model, occupying 209 GB on disk, worked, setting a critical precedent.

The developer community quickly pushed boundaries further. Others managed to run even larger models, such as DeepSeek-V3 with 671 billion parameters and Kimi K2.5 with a staggering one trillion parameters, on similar MacBook hardware. Inference speeds were notably slow, but the mere fact they functioned marked a milestone in decentralized AI computing.

An iPhone with 12 GB of RAM runs a 400B model, redefining the boundaries of on-device AI.

Smartphone with multiple camera lenses on dark background — Photo by ubeyonroad on Unsplash

The iPhone Test

Inspired by these achievements, another developer under the alias Anemll took the experiment to the extreme: attempting to run the Qwen 3.5 397B model on an iPhone 17 Pro with its 12 GB of unified memory. Against all odds, the model executed, producing responses at a mere 0.6 tokens per second. This rate is nearly unusable for practical applications, but the technical demonstration is profound.

Subsequently, Anemll optimized the approach by reducing the number of "experts" in the MoE architecture to four, doubling the speed to 1.1 tokens per second with an estimated 2.5% loss in response quality. Meanwhile, another user ran a smaller model, Qwen 3.5 35B, on the same iPhone, achieving a much more usable speed of 13.1 tokens per second. These experiments showcase a spectrum of trade-offs between model size, speed, and quality.

0.6Tokens per second when running a 400B model on an iPhone 17 Pro, showing technical feasibility at low speed.

Historical Context and Methodology

This advancement doesn't come out of thin air. Three years ago, Apple researchers published a study titled "LLM in a flash," proposing to use not just the unified memory of Apple devices but also their internal storage to run large AI models. The idea was to circumvent RAM limitations through efficient memory swapping techniques.

Woods applied this methodology using advanced tools like Claude Code with the Claude Opus 4.6 model and adopted the "autoresearch" approach popularized by Andrej Karpathy. This automated AI research method helped implement Flash-MoE, demonstrating how collaboration between academic research and community development can yield technological leaps.

Implications for the Future of AI

The ability to run gigantic models on modest hardware has significant ramifications. First, it challenges the narrative that advanced AI is permanently tethered to the cloud and massive data centers. Companies like GLM and other players in the open-source AI space could see accelerated adoption of their models if hardware barriers diminish.

“Markets are always looking at the future, not the present.”
— Xataka

Second, this could spur a new wave of truly private, offline AI applications, attracting users concerned about data privacy. Finally, for the hardware industry, especially Apple, it validates the unified memory architecture and could influence future design decisions, though current speed remains a critical bottleneck requiring ongoing innovation in software and chips.

Timeline

2023Apple researchers publish 'LLM in a flash,' proposing storage use to run large models on limited hardware.

Mar 2026Daniel Woods creates Flash-MoE and runs Qwen 3.5 397B on a MacBook Pro with 48 GB of RAM.

Mar 2026Developers run models like DeepSeek-V3 (671B) and Kimi K2.5 (1T) on MacBooks, pushing boundaries further.

Mar 2026Anemll successfully runs Qwen 3.5 397B on an iPhone 17 Pro with 12 GB of RAM, proving technical feasibility.

iPhone 17 Pro Runs 400B AI Model Locally: Memory Barrier Shattered

The Engine That Made It Happen

The iPhone Test

Historical Context and Methodology

Implications for the Future of AI

Related Articles

Nium launches stablecoin card issuance platform across Visa and Mastercard

Madrid expands regulated parking to Sundays and nights, making street parking costlier and harder

China Accelerates 6G Push to Control Next Tech Revolution, Targets 2030 Deployment