How to Run AI Models Locally on Your Laptop: A Practical Guide to Ollama

Table of Contents

Running AI models locally on a laptop with Ollama — no cloud required — *Image: Elekes Andor via Wikimedia Commons (CC BY-SA 4.0)*

Why Bother Running AI on Your Own Machine?

I used to think local AI was just for researchers with GPU clusters. You know the type — the ones who can justify spending $40,000 on eight A100s because they’re “exploring the attention mechanism.” That’s not most of us.

Then I actually tried it on my laptop. A three-year-old ThinkPad with 16GB of RAM. And it worked. Not “technically it didn’t crash” worked. Actually useful worked.

Here’s the thing that changed my mind: every time you type something into ChatGPT or Claude, you’re sending your thoughts to someone else’s computer. For casual questions, fine. But for client code, internal documents, or anything remotely sensitive — that’s a different conversation. Running models locally means your data never leaves your machine. No API keys to leak. No usage caps to hit at 3 AM when you’re debugging. No “we’re experiencing high demand” messages.

Also, and I say this as someone who builds software for a living: the privacy argument isn’t hypothetical anymore. Earlier this month, lawmakers proposed banning AI companies from selling health and location data people reveal to chatbots. The fact that we even need that law tells you something.

What You Actually Need (Spoiler: Less Than You Think)

Let me save you the spec-sheet anxiety. You don’t need a Threadripper. You don’t need 128GB of RAM. The minimum that actually works:

8GB RAM — for 3B parameter models (Phi-3, Gemma 2B)
16GB RAM — for 7-8B models (Llama 3, Mistral, Gemma 7B)
32GB RAM — for 13-14B models (Qwen 2.5, Phi-4)
Any modern CPU — Apple Silicon M1+ is great, but Intel/AMD work fine too
GPU optional — it helps but isn’t required for smaller models

The 7-8B range is the sweet spot. Models in this class are genuinely useful — they can write code, summarize documents, answer questions with real depth, and follow multi-step instructions. And they run on a machine you probably already own.

I tested Llama 3.1 8B on my 16GB laptop and got about 12-15 tokens per second. That’s reading speed. Not blistering fast, but completely usable for chat. If you have a MacBook with Apple Silicon, you’ll see 25-40 tokens per second on the same model. The chip matters, but not nearly as much as the model size.

Step 1: Install Ollama (The Easy Way)

There are several ways to run local models. LM Studio has a nice GUI. llama.cpp gives you maximum control. GPT4All is dead simple. But Ollama strikes the best balance between power and simplicity, so that’s what we’ll use.

Grab the installer from ollama.com. It’s available for macOS, Linux, and Windows. One-click install on Mac and Windows. On Linux, it’s a single curl command:

curl -fsSL https://ollama.com/install.sh | sh

Once installed, Ollama runs as a background service. You can verify it’s working with:

ollama --version

That’s it. No Docker. No CUDA drivers to wrestle with. No Python virtual environments to activate. Just install and go. I remember spending an entire Saturday afternoon trying to get a local LLM running two years ago — compiling llama.cpp from source, hunting down the right model format, converting weights. It was a mess. This is better.

Step 2: Pull and Run Your First Model

Here’s the command that actually does something:

ollama run llama3.2

On first run, it downloads the model (about 4.7GB for the 3B version, 8-10GB for 8B). After that, you’re dropped into a chat interface right in your terminal. Type your question, hit Enter, and watch it generate.

Some models worth trying depending on what you need:

llama3.2 (3B) — Fast, lightweight, good enough for simple tasks
llama3.1 (8B) — Solid all-rounder, my daily driver
mistral (7B) — Strong at reasoning and structured output
gemma2 (9B) — Google’s model, excellent instruction following
phi3 (3.8B) — Microsoft’s tiny model, surprisingly good at code
qwen2.5-coder (7B) — If you’re doing programming specifically
deepseek-r1 (8B) — Reasoning model that shows its chain of thought

Each model has different strengths. Llama 3.1 is the safe default — Meta trained it on a massive dataset, and it handles most tasks without weird behavior. Mistral is better for structured tasks like extracting information or following specific formats. Gemma 2 feels more conversational. Don’t overthink it — download a couple and see which one fits your brain.

One thing I learned the hard way: the 8B models eat about 5-6GB of RAM. If you’re running Docker, an IDE, a browser with 40 tabs, and Slack — you’ll feel it. Close what you can before loading a model. Or use the 3B version for quick tasks when memory is tight.

Step 3: Use It Like a Real Tool (Not a Toy)

The terminal chat interface is nice for testing, but you’ll want to actually use these models in your workflow. Here are three ways that don’t feel like a gimmick:

Code Review Buddy

Ollama exposes a REST API at localhost:11434. This means any tool that speaks HTTP can talk to your model. VS Code extensions like Continue.dev and Cody can connect to your local Ollama instance. So instead of sending your code to OpenAI’s servers for review, the analysis happens on your machine.

I’ve been using Continue with Llama 3.1 for quick code explanations — highlight a function, ask “what does this do,” and get an answer in seconds. It’s not as polished as Claude for complex refactoring, but for 80% of the questions I have while coding, it’s more than enough.

Document Summarizer

Got a 40-page PDF and 10 minutes before a meeting? Pipe it through:

cat document.txt | ollama run llama3.2 "Summarize this in 3 bullet points"

For longer documents, you’ll want to break them into chunks — most 7B models have context windows of 8K-32K tokens. But for emails, meeting notes, and spec documents, this works out of the box.

Privacy-Safe Drafting

I write a lot of internal memos at work — the kind that mention specific projects, budgets, and personnel. Those should never touch a cloud service. With a local model, I can say “rewrite this memo to be more direct” without worrying about where that text ends up. It’s a small thing until you need it, and then it’s everything.

As I mentioned in my guide on reviewing AI-generated code, the key isn’t trusting AI blindly — it’s knowing exactly what it’s doing with your data. Local models make that conversation a lot simpler.

What Local Models Still Can’t Do Well

I want to be honest here because the hype around local AI can get out of hand. These models are impressive, but they have limits:

Long context is hard. Most 7-8B models max out at 32K tokens (about 50 pages). GPT-4 and Claude handle 128K-200K. If you’re working with entire codebases or book-length documents, cloud models still win.
Complex reasoning lags. Multi-step math, legal analysis, and nuanced policy questions are where 8B models show their size. They’ll try, but the answer won’t have the depth of a frontier model.
Multilingual performance varies. English is excellent. Filipino, Taglish, Cebuano — these smaller models weren’t trained extensively on Philippine languages. Llama 3.1 handles basic Tagalog okay; Mistral struggles.
No browsing, no tools, no memory between sessions. Your local model is a frozen snapshot of its training data. It doesn’t know today’s news and can’t look things up unless you build that infrastructure around it.

The biggest practical limitation for me is context. When I’m working with AI agents that need to chain multiple tool calls, the 8K-32K context window fills up fast. You learn to be economical with your prompts — shorter, more specific, fewer examples.

When Local Beats Cloud (And When It Doesn’t)

After a few months of using both, here’s my personal rule of thumb:

Use local when: You’re working with sensitive data, you need offline access, you’re doing repetitive tasks where latency matters more than intelligence, or you’re just exploring and don’t want to burn API credits.

Use cloud when: You need the smartest possible answer, you’re working with very long documents, you need real-time information, or the task genuinely requires frontier-model reasoning.

The sweet spot I’ve found: use local models for 70% of your AI interactions (code explanations, quick drafts, data extraction, first-pass reviews), and reserve cloud models for the 30% that actually need the extra intelligence. That ratio saves money, protects privacy, and honestly — makes you more intentional about when you reach for the big guns.

If you’re doing this on a laptop, you’ll want to keep it from sleeping during long model runs — I covered that exact setup for Mac, Linux, and Windows after losing a few generations to my MacBook’s aggressive power saving.

The Bottom Line

Running AI models locally is no longer a hobbyist project — it’s a legitimate tool for daily work. The barrier dropped from “need a GPU cluster” to “download an app and wait 10 minutes.” If you haven’t tried it since the llama.cpp days of 2023, the experience is unrecognizable now.

Start with Ollama. Pull llama3.2. Ask it something. See how it feels.

You might be surprised at how capable a 4.7GB file on your laptop can be. I was.

And if you’re worried about privacy — you should be. Not paranoid, but practical. The same way you wouldn’t email your bank password to a stranger, you probably shouldn’t paste your company’s internal architecture document into a public chatbot. Local models give you an option that didn’t exist a year ago: keep the intelligence, lose the exposure.

That’s a trade worth making.