Google Boosts Gemma 4 Inference Speeds Up to 3x With Speculative Decoding Upgrade
Google has introduced a multi-token prediction (MTP) drafter for its open Gemma 4 models, using speculative decoding to deliver up to three times faster inference without sacrificing output quality, marking a significant step toward practical offline large language models.

Google has rolled out a major performance upgrade for its Gemma 4 open-weight models, introducing a multi-token prediction (MTP) drafter designed to dramatically accelerate inference. By leveraging a speculative decoding architecture, the company says it can increase generation speeds by up to three times while maintaining output quality and logical coherence.
The update comes just weeks after Gemma 4 gained traction in the open model community, with downloads reportedly surpassing 60 million. The new release focuses squarely on one of the most persistent bottlenecks in large language model deployment: inference latency.
How Speculative Decoding Works
Traditional transformer-based models are often limited not by raw compute, but by memory bandwidth. During inference, billions of parameters must be repeatedly moved between memory and processing units. This data transfer is significantly slower than the computation itself, leaving hardware underutilized and introducing response delays.
Google’s approach pairs a large “target” model, such as Gemma 4 31B, with a lightweight MTP drafter model. The drafter uses otherwise idle compute capacity to predict multiple future tokens in advance. The larger model then verifies these predictions in parallel. If the predictions match, entire token sequences can be confirmed in a single pass, reducing redundant computation and shortening generation cycles.
This cooperative “draft-and-verify” setup enables higher throughput without compromising model fidelity, making speculative decoding increasingly attractive for real-world deployments.
Strong Gains on Local Hardware
According to Google’s benchmarks, the performance gains are particularly noticeable on local devices. On Apple Silicon systems, the Gemma 4 26B model achieved roughly 2.2× faster inference at batch sizes between four and eight. Similar benefits are expected on consumer-grade GPUs.
These improvements could make it more practical to run advanced coding assistants, chatbots, and agent-based workflows directly on personal machines. Faster inference also reduces energy consumption per task, an important factor for edge devices and mobile AI applications.
Implications for Low-Latency AI Applications
The update is especially relevant for latency-sensitive use cases such as real-time chat systems, automated programming tools, and autonomous agents. By narrowing the trade-off between speed and accuracy, Google is positioning Gemma 4 as a more viable option for offline and hybrid deployments.
As inference efficiency improves and hardware requirements become less restrictive, open-weight models like Gemma 4 may accelerate the shift from cloud-dependent AI toward capable on-device systems. While cloud infrastructure will remain central for large-scale workloads, advances in speculative decoding suggest that the era of practical offline large language models is moving closer to reality.