Why Run LLMs Locally?
Cloud APIs are convenient, but they come with trade-offs: data leaves your network, latency depends on internet connectivity, and costs scale linearly with usage. For enterprises in regulated industries — healthcare, finance, legal — local inference isn't optional, it's mandatory.
Beyond compliance, local LLMs enable rapid prototyping without API costs, offline development workflows, and complete control over model behavior and versioning.
Cost Reality: A team making 10,000 GPT-4 API calls/day spends ~$3,000/month. An RTX 4090 running Llama 3 8B locally costs $0/month after the initial hardware investment — and responds in under 500ms.
Getting Started with Ollama
Ollama makes local LLM deployment as simple as Docker makes container management. Install it, pull a model, and you're running inference in under 5 minutes.
Top Models to Try
- Llama 3 8B: Best all-around open model. Fits in 8GB VRAM with Q4_K_M quantization. Excellent at conversation, analysis, and code.
- Mistral 7B: Exceptional at code generation and structured outputs. Slightly faster than Llama 3 at similar quality.
- Phi-3 Mini (3.8B): Microsoft's compact powerhouse — surprisingly capable for its size. Perfect for resource-constrained environments.
- CodeLlama 13B: Purpose-built for software engineering tasks. Understands 15+ programming languages natively.
- Gemma 2 9B: Google's open model with strong multilingual capabilities and instruction-following.
Custom Modelfiles
Ollama's Modelfile format lets you create specialized model configurations with custom system prompts, temperature settings, and context windows. Think of it as a Dockerfile for LLMs — version-controllable and reproducible.
Example Modelfile: Define a specialized "SQL Expert" by setting a system prompt that constrains the model to only output valid PostgreSQL queries, with temperature 0.1 for deterministic results and a 4096-token context window.
Key Modelfile parameters to tune:
- temperature: 0.0-0.3 for deterministic tasks (code, SQL), 0.7-1.0 for creative tasks.
- num_ctx: Context window size. Default 2048, increase to 8192+ for long documents.
- top_p / top_k: Control sampling diversity. Lower values = more focused outputs.
- system: Your system prompt that defines the model's persona and constraints.
API Integration
Ollama exposes an OpenAI-compatible REST API on localhost:11434, meaning you can swap it into any existing OpenAI-based application with a single base URL change.
This makes local development and testing seamless before deploying to cloud inference in production. Your LangChain chains, LlamaIndex pipelines, and custom applications work identically — just change the endpoint.
Production Deployment Tips
- GPU Memory Management: Use
OLLAMA_NUM_PARALLELto control concurrent requests. Start with 2 and increase based on VRAM headroom. - Model Preloading: Keep frequently-used models loaded with
OLLAMA_KEEP_ALIVEto eliminate cold-start latency. - Load Balancing: Run multiple Ollama instances behind nginx for horizontal scaling across GPUs.
- Monitoring: Track GPU utilization, token throughput, and p99 latency to right-size your deployment.
When to Go Local vs. Cloud
Local inference is ideal for: development/testing, regulated data, high-volume repetitive tasks, and offline scenarios. Stick with cloud APIs for: state-of-the-art reasoning (GPT-4/Claude level), multi-modal tasks, and when you need the absolute highest quality regardless of cost.
The best architecture often combines both — local models handle 80% of routine queries cheaply, while complex requests are routed to cloud APIs. This hybrid approach typically reduces AI infrastructure costs by 60-70%.