Inference

A model is just a file until you run it somewhere. Inference is the infrastructure layer that turns weights into answers — and the deployment pattern you choose shapes everything from cost to compliance.

Four Deployment Patterns

Most production systems use more than one of these. The trick is knowing which pattern fits which workload — and when to mix them.

Simplest path to production

Cloud APIs

Call a model through an API endpoint — OpenAI, Anthropic, Google, Mistral, and others. No infrastructure to manage, no GPUs to provision. You send a request, you get a response. This is where most projects start, and for many workloads it's where they should stay.

Pay per token, scale instantly
Always running the latest model versions
Data leaves your network on every call

Control cost and latency

Self-Hosted

Run open-weight models on your own cloud infrastructure using serving engines like vLLM, Ollama, or TGI. You pick the model, the hardware, and the configuration. More work upfront, but you control the economics and can optimise for your specific workload.

Fixed compute cost, predictable at scale
Tune batch sizes, quantisation, and caching
You own the ops — updates, scaling, monitoring

Data sovereignty and air-gapped

On-Prem

Models running on hardware you physically control — in your data centre, behind your firewall. This is the answer when regulation or policy says data cannot leave the building. It's the most expensive pattern to operate, but sometimes it's the only compliant option.

Full data sovereignty — nothing leaves the network
Required for air-gapped and classified environments
Highest capital and operational cost

On-device, low latency

Edge

Small, quantised models running directly on phones, laptops, IoT devices, or embedded hardware. No network round-trip, no API cost per call. The tradeoff is capability — edge models are smaller and less powerful, but for the right use case they're unbeatable on latency and privacy.

Zero network dependency — works offline
Sub-millisecond inference latency
Limited to smaller, quantised models

What's the Difference?

Each pattern makes different tradeoffs. Here's how they compare on the dimensions that matter most.

	Cloud APIs	Self-Hosted	On-Prem	Edge
Cost model	Pay per token	Fixed compute	Capital + ops	Device hardware
Latency	Network-bound	Tuneable	Low (local network)	Lowest
Data sovereignty	Data leaves network	Your cloud tenant	Full control	On-device only
Setup complexity	Minutes	Hours to days	Weeks to months	Varies widely
Model selection	Provider's catalogue	Any open-weight model	Any open-weight model	Small/quantised only
Scaling	Automatic	Manual or auto-scale	Hardware-bound	Per-device

In practice, I often recommend a hybrid approach: cloud APIs for development and bursty workloads, self-hosted for steady-state production traffic, and on-prem or edge only where compliance or latency demands it.

Pairs With

Inference doesn't exist in isolation. The deployment pattern you choose ripples through the rest of the stack.

Models

The same model can run in very different ways depending on the infrastructure underneath it. A 70B parameter model served via cloud API behaves differently than the same model self-hosted with vLLM and custom batching. The model is the what — inference is the how.

Safety

Your inference pattern determines your trust boundaries. Cloud APIs mean data leaves your network on every call. Self-hosted means you control the perimeter but still need to secure the endpoint. On-prem gives you full control — but full responsibility too. Every safety architecture starts with "where does the data go?"

Need help choosing a deployment pattern?

I help teams figure out which inference strategy fits their workload, budget, and compliance requirements — and build the infrastructure to make it real.

Start a Conversation See Services