Running LLMs Locally with Ollama

Back to Blog

Why Run LLMs Locally?

Cloud APIs are convenient, but they come with trade-offs: data leaves your network, latency depends on internet connectivity, and costs scale linearly with usage. For enterprises in regulated industries — healthcare, finance, legal — local inference isn't optional, it's mandatory.

Beyond compliance, local LLMs enable rapid prototyping without API costs, offline development workflows, and complete control over model behavior and versioning.

Cost Reality: A team making 10,000 GPT-4 API calls/day spends ~$3,000/month. An RTX 4090 running Llama 3 8B locally costs $0/month after the initial hardware investment — and responds in under 500ms.

Getting Started with Ollama

Ollama makes local LLM deployment as simple as Docker makes container management. Install it, pull a model, and you're running inference in under 5 minutes.

Top Models to Try

Llama 3 8B: Best all-around open model. Fits in 8GB VRAM with Q4_K_M quantization. Excellent at conversation, analysis, and code.
Mistral 7B: Exceptional at code generation and structured outputs. Slightly faster than Llama 3 at similar quality.
Phi-3 Mini (3.8B): Microsoft's compact powerhouse — surprisingly capable for its size. Perfect for resource-constrained environments.
CodeLlama 13B: Purpose-built for software engineering tasks. Understands 15+ programming languages natively.
Gemma 2 9B: Google's open model with strong multilingual capabilities and instruction-following.

Custom Modelfiles

Ollama's Modelfile format lets you create specialized model configurations with custom system prompts, temperature settings, and context windows. Think of it as a Dockerfile for LLMs — version-controllable and reproducible.

Example Modelfile: Define a specialized "SQL Expert" by setting a system prompt that constrains the model to only output valid PostgreSQL queries, with temperature 0.1 for deterministic results and a 4096-token context window.

Key Modelfile parameters to tune:

temperature: 0.0-0.3 for deterministic tasks (code, SQL), 0.7-1.0 for creative tasks.
num_ctx: Context window size. Default 2048, increase to 8192+ for long documents.
top_p / top_k: Control sampling diversity. Lower values = more focused outputs.
system: Your system prompt that defines the model's persona and constraints.

API Integration

Ollama exposes an OpenAI-compatible REST API on localhost:11434, meaning you can swap it into any existing OpenAI-based application with a single base URL change.

This makes local development and testing seamless before deploying to cloud inference in production. Your LangChain chains, LlamaIndex pipelines, and custom applications work identically — just change the endpoint.

Production Deployment Tips

GPU Memory Management: Use OLLAMA_NUM_PARALLEL to control concurrent requests. Start with 2 and increase based on VRAM headroom.
Model Preloading: Keep frequently-used models loaded with OLLAMA_KEEP_ALIVE to eliminate cold-start latency.
Load Balancing: Run multiple Ollama instances behind nginx for horizontal scaling across GPUs.
Monitoring: Track GPU utilization, token throughput, and p99 latency to right-size your deployment.

When to Go Local vs. Cloud

Local inference is ideal for: development/testing, regulated data, high-volume repetitive tasks, and offline scenarios. Stick with cloud APIs for: state-of-the-art reasoning (GPT-4/Claude level), multi-modal tasks, and when you need the absolute highest quality regardless of cost.

The best architecture often combines both — local models handle 80% of routine queries cheaply, while complex requests are routed to cloud APIs. This hybrid approach typically reduces AI infrastructure costs by 60-70%.