Getting Started

This site is built around one rule: start with the setup order that keeps working in real Gemma 4 threads, not with benchmark charts.

The community-tested install order

Start with the easiest runtime for your device instead of the most configurable one.
Start with the smallest model that has a real chance of fitting comfortably.
Keep the first run short. One clean reply matters more than a huge context test.
Confirm that the device is actually using the accelerator path you expect.
Only after short chat works should you test long context, vision, or tool calling.
If agentic use matters, update the runtime and template before you decide the model is bad.

Before you download anything

Answer these four questions first:

Is your target device a phone, tablet, laptop, or desktop?
How much RAM, VRAM, or unified memory does it actually have?
Do you want the easiest UI, or the most control?
Are you optimizing for speed, quality, or just proving that it runs?

Fastest paths by device

Android: start with AI Edge Gallery and test E2B before E4B.
iPhone and iPad: start with the iPhone and iPad guide and keep expectations light.
Mac: start with the Mac guide and use LM Studio before deeper runtime tuning.
Windows: start with the Windows guide and check VRAM before anything else.

Fastest paths by runtime preference

I want the easiest desktop app: LM Studio
I want the most control: llama.cpp
I want a local API: Ollama
I want a phone-first experience: AI Edge Gallery

What Reddit users keep getting wrong

They download 31B or an aggressive desktop quant before checking whether the machine can even carry it.
They assume a phone GPU or NPU is active when the app has silently fallen back to CPU.
They judge Gemma 4 through an old desktop backend, stale template, or older quant build.
They jump into long context, agent tools, or vision before proving that short chat is stable.

Three setups worth copying

Phone daily driver: AI Edge Gallery plus E2B is still the safest mobile proof. It often feels better than forcing E4B on a phone that is not accelerating cleanly.
16GB desktop GPU: 26B A4B can work, but only if you treat quant choice, context length, and vision overhead as first-class setup decisions.
Repurposed mobile hardware: a stripped Android phone can become a small local inference node, but the community keeps pushing these builds toward llama.cpp once performance starts to matter.

What “offline” actually means here

Running Gemma 4 locally means inference is happening on your device. It does not mean the model has live web access. The official Gemma 4 model card lists the training cutoff as January 2025.

Next: pick a model size Use E2B, E4B, 26B A4B, or 31B based on hardware and intent.

Read the Reddit-driven setup guide See the repeated Gemma 4 failures and the real local setups people actually kept using.