Back to HomeAI Dev Tools

How to Run Gemma 4 Locally: Ollama, LM Studio, and Unsloth Complete Guide

12 min min read
#Gemma 4#Local Deployment#Ollama#LM Studio#Unsloth#vLLM#Quantization#Offline AI#Privacy#Developer Tutorial

How to Run Gemma 4 Locally: Ollama, LM Studio, and Unsloth Complete Guide

Terminal screen showing the process of downloading and running a Gemma 4 model locally

TL;DR: Ollama gets you running in five minutes with a single command — ideal for developers. LM Studio offers a full GUI — perfect for non-technical users. Unsloth uses ~40% less memory and connects seamlessly to fine-tuning workflows. All three support every Gemma 4 model size. Pick based on your use case.

"Just install Ollama, right?" That is the most common answer I hear. But Ollama is only one of three mainstream options, and it may not be the best fit for you.

Over the past week I deployed Gemma 4 E4B and 26B using all three methods, hit several undocumented gotchas, and discovered some nuances the official docs do not mention. This article walks through the full process, trade-offs, and common issues for each approach.

Not sure which deployment path fits your team? Book a free architecture consultation and let our consultants recommend the best approach based on your hardware and requirements.

Want to understand Gemma 4's full capabilities and model family first? See the Gemma 4 Complete Guide.


Why Run Gemma 4 Locally? Three Reasons

Before you start, ask yourself honestly: do you actually need local deployment?

If you only run a few queries a day, Google AI Studio or the Vertex AI API is probably easier. But if any of the following scenarios applies, local deployment is the right call.

Reason 1: Data Privacy and Compliance

Healthcare, finance, legal — industries where data cannot leave your network. Local deployment means your prompts and responses never touch a third-party server. GDPR, HIPAA, and most regional data-protection regulations are satisfied automatically.

Reason 2: Offline Availability

On a plane, at a remote construction site, in a factory with unreliable connectivity. Local deployment lets you use AI with zero internet. One of my clients runs Gemma 4 E4B on an offshore wind-turbine platform for equipment inspection reports — completely offline.

Reason 3: Zero API Cost

API pricing is per-token, and costs scale fast. Local deployment has near-zero marginal cost — the hardware is a one-time investment and electricity is negligible. If you process hundreds of thousands of tokens per day, local deployment pays for itself within three months.


Before You Start: Confirm Your Hardware Can Handle It

Before installing anything, check your hardware. Choosing the wrong model size means either painfully slow inference or an outright OOM crash.

Quick reference:

Model4-bit VRAMRecommended Minimum Hardware
E2B (2.3B)~1.5 GBPhone, Raspberry Pi
E4B (4.3B)~3 GBLaptop with 8 GB RAM
26B MoE (A4B)~16 GBRTX 4060 Ti 16 GB / M3 24 GB
31B Dense~18 GBRTX 4090 24 GB / M4 Pro 48 GB

Need a more detailed breakdown? See the Gemma 4 Hardware Requirements Guide for per-GPU and Apple Silicon benchmarks.


Method 1: Ollama — Up and Running in Five Minutes

Three deployment paths compared — CLI, GUI, and API

Ollama is the simplest local LLM tool available today. One command to install, one to download a model, one to start chatting. If you are a developer, this is where I recommend starting.

Step 1: Install Ollama

macOS:

# Option A: direct install
curl -fsSL https://ollama.com/install.sh | sh

# Option B: Homebrew
brew install ollama

Linux:

curl -fsSL https://ollama.com/install.sh | sh

Windows:

Download the .exe installer from ollama.com and run it.

After installation, Ollama automatically starts a background service listening on localhost:11434.

Step 2: Download a Gemma 4 Model

# Recommended starting point: E4B (~3 GB)
ollama pull gemma4:e4b

# Stronger reasoning: 26B MoE (~16 GB)
ollama pull gemma4:26b

# Lightest option: E2B (~1.5 GB)
ollama pull gemma4:e2b

Download time depends on your connection. E4B is roughly 3 GB — about four minutes on a 100 Mbps link. The 26B Q4 variant weighs in at around 16 GB.

Step 3: Start Chatting

# Interactive chat
ollama run gemma4:e4b

# With verbose output (shows thinking tokens)
ollama run gemma4:e4b --verbose

You will see an interactive prompt. Type a question and the model answers. Press Ctrl+D to exit.

Step 4: Connect via API

Ollama ships with an OpenAI-compatible API server, so your existing code barely needs to change:

# Test the API
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma4:e4b",
    "messages": [{"role": "user", "content": "Explain quantum computing in three sentences."}]
  }'
# Python example (using the openai package)
from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
response = client.chat.completions.create(
    model="gemma4:e4b",
    messages=[{"role": "user", "content": "Explain quantum computing in three sentences."}]
)
print(response.choices[0].message.content)

Choosing a Quantization Level

Ollama defaults to Q4_K_M, which is good enough for most tasks. If you need higher quality:

# Specify quantization
ollama pull gemma4:26b-q8_0    # 8-bit — better quality, more VRAM
ollama pull gemma4:e4b-fp16    # Full precision — requires 8.6 GB VRAM

Ollama pros: Easiest installation, largest community, good API compatibility, rich model library. Ollama cons: No GUI, advanced tuning requires Modelfiles, no fine-tuning support.

Want to integrate Ollama into your product? Book a technical consultation — we have extensive experience with local LLM integration.


Method 2: LM Studio — Zero-Friction GUI Experience

LM Studio GUI interface — model selection and chat

Not everyone likes typing commands. If you are a PM, a designer, or simply someone who wants to try Gemma 4 without touching a terminal, LM Studio is the best choice.

LM Studio shipped day-one support for Gemma 4. Version 0.4.0 also introduced the lms CLI and llmster, giving power users a command-line option inside the same tool.

Step 1: Download and Install

Head to lmstudio.ai and download the installer for your OS. Available for macOS, Windows, and Linux.

The installation process is the same as any desktop app — next, next, finish.

Step 2: Search and Download a Gemma 4 Model

  1. Open LM Studio
  2. Click the "Discover" tab on the left
  3. Search for gemma-4
  4. You will see Unsloth's pre-quantized GGUF variants
  5. Pick one that fits your memory and click "Download"

Recommendations:

  • 8 GB RAM machine: gemma-4-E4B-it-GGUF (Q4_K_M)
  • 16 GB+ RAM machine: gemma-4-26B-A4B-it-GGUF (Q4_K_M)

Step 3: Load the Model and Adjust Settings

  1. Click the "Chat" tab
  2. Select your downloaded model from the dropdown at the top
  3. Adjust settings in the right panel:
    • Context Length: Default 4096; Gemma 4 small models support up to 128K
    • Temperature: Higher (0.7-1.0) for creative tasks, lower (0.1-0.3) for precision
    • GPU Offload: Slide to max if you have a discrete GPU

Step 4: Start Chatting

Type your question and go. LM Studio also supports:

  • Multimodal input: Drag an image into the chat — all Gemma 4 sizes support vision
  • System prompts: Define the model's role and behavior in the settings panel
  • Conversation history: Automatically saved; resume next time you open the app

Advanced: Using LM Studio as an API Server

LM Studio can also serve as a local API endpoint with OpenAI compatibility:

  1. Click the "Developer" tab
  2. Select a model and click "Start Server"
  3. Default endpoint: http://localhost:1234/v1
curl http://localhost:1234/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma-4-e4b-it",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

LM Studio pros: Zero learning curve, built-in model browser, multimodal drag-and-drop, API server mode. LM Studio cons: Higher resource overhead (Electron app), no fine-tuning, power users may find the GUI unnecessary.


Method 3: Unsloth — Maximum Efficiency + Fine-Tuning Ready

If your goal goes beyond "just run it" — fine-tuning, custom quantization, or squeezing maximum efficiency out of constrained hardware — Unsloth is the right tool.

Unsloth shipped day-one Gemma 4 support with pre-quantized GGUF and MLX models. Its MLX builds use roughly 40% less memory than Ollama on Apple Silicon, at the cost of 15-20% lower token throughput.

Step 1: Install Unsloth

# Create a virtual environment (recommended)
python3 -m venv unsloth-env
source unsloth-env/bin/activate

# Install Unsloth
pip install unsloth

If you are using an NVIDIA GPU, make sure CUDA toolkit 11.8+ is installed.

Step 2: Run Inference with Unsloth

from unsloth import FastLanguageModel

# Load model (auto-detects optimal quantization)
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/gemma-4-E4B-it",
    max_seq_length=4096,
    load_in_4bit=True,
)

# Switch to inference mode
FastLanguageModel.for_inference(model)

# Prepare input
messages = [{"role": "user", "content": "Explain LoRA fine-tuning."}]
inputs = tokenizer.apply_chat_template(
    messages, tokenize=True, add_generation_prompt=True, return_tensors="pt"
).to("cuda")

# Generate
outputs = model.generate(input_ids=inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Step 3: Production-Grade Inference with vLLM

For multi-user serving, vLLM's batched inference is categorically better:

# Install vLLM
pip install vllm

# Start vLLM server
vllm serve unsloth/gemma-4-26B-A4B-it-GGUF \
  --quantization awq \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.9

vLLM's continuous batching and PagedAttention deliver 3-5x higher throughput than naive inference under real concurrent load.

Step 4: Seamless Transition to Fine-Tuning

This is Unsloth's killer advantage — same framework, no tool-switching:

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/gemma-4-E4B-it",
    max_seq_length=2048,
    load_in_4bit=True,
)

# Add LoRA adapter
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    lora_alpha=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
)
# Ready to train...

For the full fine-tuning walkthrough, see the Gemma 4 Fine-Tuning Guide.

Unsloth pros: Best memory efficiency, seamless inference-to-fine-tuning pipeline, optimized MLX for Mac, active community. Unsloth cons: Requires Python, more complex setup, not suited for non-technical users.

Need production-grade Gemma 4 deployment? Contact our architecture team — from inference servers to fine-tuning pipelines, we handle the full stack.


Which Method Should You Choose? Decision Matrix

CriteriaOllamaLM StudioUnsloth
Ease of setupLow (one command)Easiest (GUI)Medium-high (Python)
Install time2 min3 min10-15 min
Memory efficiencyMediumMediumHigh (~40% savings)
Inference speedFastFastMedium (MLX 15-20% slower)
API compatibilityOpenAI-compatibleOpenAI-compatiblePair with vLLM
GUINoneFull desktop appNone
Fine-tuningNot supportedNot supportedNative support
MultimodalSupportedSupported (drag-and-drop)Supported
Best forDevelopers, CLI fansNon-technical usersML engineers, fine-tuning

My recommendations:

  • "I just want to try it" — LM Studio. Download, install, search model, start chatting. Five minutes, zero commands.
  • "I need to integrate it into my app" — Ollama. Most stable API, largest community, Docker-friendly.
  • "I need fine-tuning or I'm memory-constrained" — Unsloth. That 40% memory saving is real, and the inference-to-training pipeline is seamless.

Troubleshooting Common Issues

OOM (Out of Memory) Errors

The most common problem. Symptoms: the model crashes mid-load, or the process gets killed during inference.

Solutions:

  1. Switch to a smaller quantization: Move from Q8 to Q4_K_M, or from Q4 to Q3_K_S
  2. Reduce context length: Drop from 128K to 8K or 4K
  3. Close memory-hungry apps: Chrome is usually the biggest offender
  4. Increase swap space: On Linux, adding swap slows things down but at least lets the model load
# Check current memory usage
nvidia-smi          # NVIDIA GPU
ollama ps           # See which models Ollama has loaded

Slow Inference Speed

If the model runs but speed is poor (below 10 tok/s):

  1. Confirm the GPU is being used: Run nvidia-smi — if utilization is 0%, the model is running on CPU
  2. Increase GPU layers in Ollama: Create a Modelfile and set the num_gpu parameter
  3. Use more aggressive quantization: Q4_K_S is roughly 10-15% faster than Q4_K_M
  4. Mac users: use MLX builds: 30-50% faster than the llama.cpp backend

Download Failures

# Ollama: re-running pull resumes automatically
ollama pull gemma4:26b

# Manual GGUF download (for LM Studio or manual setups)
huggingface-cli download unsloth/gemma-4-26B-A4B-it-GGUF \
  --include "gemma-4-26B-A4B-it-Q4_K_M.gguf"

If Hugging Face downloads are slow, enable hf_transfer:

pip install hf_transfer
HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download ...

Garbled or Low-Quality Output

Usually a quantization issue. Q2 and Q3 variants show noticeable quality drops on certain tasks. Switch to Q4_K_M or higher, or add a system prompt to stabilize the output format.

Stuck on a deployment issue? Reach out to our support team — we provide end-to-end technical support for local AI deployment.


Frequently Asked Questions

Can I run Ollama and LM Studio at the same time?

Yes, but avoid loading models in both simultaneously. Each tool claims its own VRAM, making OOM errors likely. If you need both running, make sure one tool has unloaded its model first.

Gemma 4 26B MoE vs. 31B Dense — which should I run locally?

The 26B MoE is almost always the better choice. Its active parameters are only 3.8B per forward pass (not all experts are loaded at inference time), so it runs faster and uses memory more efficiently. Benchmark scores are very close to the 31B Dense. Unless you have a specific task where 31B clearly wins, go with 26B.

Should I use Ollama or Unsloth MLX on a Mac?

Depends on your memory. If your Mac has 24 GB+ of unified memory, Ollama is easy and fast. If you only have 16 GB and want to run the 26B model, Unsloth's MLX build with its 40% memory savings may be your only option.

Is locally-run Gemma 4 the same quality as the API version?

The BF16 (full precision) version is identical to the API. Q8 quantization has nearly imperceptible quality loss. Q4 is good for most tasks, but you may see a 2-5% drop in math reasoning and long-form generation.


Next Steps

Local deployment is just the beginning. You may also want to explore:

Ready to bring Gemma 4 into your enterprise workflow? Book a free architecture consultation — from model selection and hardware planning to production deployment, we provide end-to-end technical support.

Need Professional Cloud Advice?

Whether you're evaluating cloud platforms, optimizing existing architecture, or looking for cost-saving solutions, we can help

Book Free Consultation

Related Articles