Inference
A model is just a file until you run it somewhere. Inference is the infrastructure layer that turns weights into answers — and the deployment pattern you choose shapes everything from cost to compliance.
Four Deployment Patterns
Most production systems use more than one of these. The trick is knowing which pattern fits which workload — and when to mix them.
Cloud APIs
Call a model through an API endpoint — OpenAI, Anthropic, Google, Mistral, and others. No infrastructure to manage, no GPUs to provision. You send a request, you get a response. This is where most projects start, and for many workloads it's where they should stay.
- Pay per token, scale instantly
- Always running the latest model versions
- Data leaves your network on every call
Self-Hosted
Run open-weight models on your own cloud infrastructure using serving engines like vLLM, Ollama, or TGI. You pick the model, the hardware, and the configuration. More work upfront, but you control the economics and can optimise for your specific workload.
- Fixed compute cost, predictable at scale
- Tune batch sizes, quantisation, and caching
- You own the ops — updates, scaling, monitoring
On-Prem
Models running on hardware you physically control — in your data centre, behind your firewall. This is the answer when regulation or policy says data cannot leave the building. It's the most expensive pattern to operate, but sometimes it's the only compliant option.
- Full data sovereignty — nothing leaves the network
- Required for air-gapped and classified environments
- Highest capital and operational cost
Edge
Small, quantised models running directly on phones, laptops, IoT devices, or embedded hardware. No network round-trip, no API cost per call. The tradeoff is capability — edge models are smaller and less powerful, but for the right use case they're unbeatable on latency and privacy.
- Zero network dependency — works offline
- Sub-millisecond inference latency
- Limited to smaller, quantised models
What's the Difference?
Each pattern makes different tradeoffs. Here's how they compare on the dimensions that matter most.
| Cloud APIs | Self-Hosted | On-Prem | Edge | |
|---|---|---|---|---|
| Cost model | Pay per token | Fixed compute | Capital + ops | Device hardware |
| Latency | Network-bound | Tuneable | Low (local network) | Lowest |
| Data sovereignty | Data leaves network | Your cloud tenant | Full control | On-device only |
| Setup complexity | Minutes | Hours to days | Weeks to months | Varies widely |
| Model selection | Provider's catalogue | Any open-weight model | Any open-weight model | Small/quantised only |
| Scaling | Automatic | Manual or auto-scale | Hardware-bound | Per-device |
In practice, I often recommend a hybrid approach: cloud APIs for development and bursty workloads, self-hosted for steady-state production traffic, and on-prem or edge only where compliance or latency demands it.
Pairs With
Inference doesn't exist in isolation. The deployment pattern you choose ripples through the rest of the stack.
Models
The same model can run in very different ways depending on the infrastructure underneath it. A 70B parameter model served via cloud API behaves differently than the same model self-hosted with vLLM and custom batching. The model is the what — inference is the how.
Safety
Your inference pattern determines your trust boundaries. Cloud APIs mean data leaves your network on every call. Self-hosted means you control the perimeter but still need to secure the endpoint. On-prem gives you full control — but full responsibility too. Every safety architecture starts with "where does the data go?"
Need help choosing a deployment pattern?
I help teams figure out which inference strategy fits their workload, budget, and compliance requirements — and build the infrastructure to make it real.