--- Most teams should start with a hosted LLM API. The economics, the operational simplicity, and the pace of model improvement all favor calling someone else's endpoint over running your own. That is the honest default, and any vendor who tells you otherwise is selling infrastructure you may not need. But "most teams" is not "all teams," and "to start" is not "forever." There is a real and growing set of situations where running models on your own infrastructure is the better engineering decision, sometimes the only legally viable one. The hard part is telling the two apart without either over-building for a compliance fear that does not actually apply, or under-building and discovering at audit time that your data went somewhere it should not have. This article is about how to make that call. It covers when private or air-gapped hosting is genuinely worth it, the open models and serving stacks involved, how to think about GPU sizing and quantization, the security model, the fine-tuning versus retrieval question, and a decision framework you can apply to your own workload. We do a lot of this work at XOVO, so the framing here reflects what tends to hold up in production rather than what looks good in a pitch deck. ## What "Private LLM Hosting" Actually Means The phrase covers a spectrum, not a single architecture. It helps to be precise, because the trade-offs differ sharply across the spectrum. At one end is a **dedicated or single-tenant deployment of a hosted model**, where a provider runs the model but isolates your traffic and data on infrastructure reserved for you. Your data still leaves your network, but it is contractually and technically partitioned. At the middle is **self-hosting open-weight models in your own cloud account** (VPC), for example running Llama or Mistral on GPU instances you control, inside your own network boundary and security groups. At the far end is **fully air-gapped, on-premises hosting**, where the model runs on hardware you own, in a network with no path to the public internet at all. Each step toward the air-gapped end buys you more control and more isolation, and costs you more in operational burden, capital, and the speed at which you can adopt newer models. The decision is rarely "API versus on-prem." More often it is "which point on this spectrum matches my actual constraints." Our [Private LLM Hosting & Deployment](/services/private-llm-hosting) work usually starts by locating a workload precisely on that spectrum before any hardware or model is chosen. ## The Real Reasons to Self-Host There are exactly five reasons that justify the operational cost of running your own models. If none of them apply with force, you probably should not self-host. If one or more applies strongly, the API default no longer holds. ### Data residency and regulatory compliance This is the most common legitimate driver. Under GDPR, personal data of EU residents carries obligations about where it is processed and who can access it; cross-border transfer mechanisms exist but add legal complexity and risk. Under HIPAA in the US, protected health information can be sent to a third-party API only under a Business Associate Agreement, and even then some organizations conclude the cleanest posture is to never let PHI leave their controlled environment. Defense, government, and certain financial workloads carry similar or stricter constraints. Major API providers do offer enterprise tiers with BAAs, regional data processing, zero-retention modes, and SOC 2 attestations, and for many regulated teams those are sufficient. Self-hosting becomes the answer when your obligations or your risk appetite require that the data physically never traverses a network boundary you do not own, or when you must demonstrate, not merely contractually assert, that no third party can access it. Air-gapped deployment is the strongest form of that demonstration. ### Intellectual property protection If your prompts and outputs encode proprietary methods, source code, unreleased product designs, or trade secrets, every API call is a transmission of that material to an external party. Reputable providers do not train on enterprise traffic and contractually commit to that, which is enough for most companies. But some IP is valuable enough, or some adversarial-threat models serious enough, that the only acceptable answer is keeping it inside the building. This is common in semiconductor design, pharmaceutical research, and proprietary trading. ### Cost at sustained high volume API pricing is excellent at low and moderate volume and stays competitive far longer than most people expect, because you pay only for tokens consumed and nothing for idle capacity. Self-hosting flips the cost structure: you pay for the GPU whether it is busy or not. That only wins when utilization is consistently high. The crossover is workload-specific, but the shape is reliable. A single high-end inference GPU rented in the cloud runs on the order of one to two US dollars per hour, which is roughly one thousand to fifteen hundred dollars per month if kept running continuously. If that GPU, well-batched, serves enough sustained traffic to displace a comparable monthly API spend, self-hosting starts to pay off. If your traffic is spiky, low-average, or unpredictable, the idle GPU bleeds money and the API wins decisively. The mistake we see most often is teams projecting a cost saving from peak throughput while their actual average utilization is fifteen percent. ### Latency and offline operation A model running on a GPU inside your own network, close to your application, removes the public-internet round trip and the variable queueing of a shared API. For interactive applications with tight tail-latency budgets, or for edge and on-premise environments that must function with no connectivity at all (a factory floor, a ship, a secure facility), self-hosting is not an optimization, it is a requirement. ### Determinism and version control Hosted models change. Providers update, deprecate, and retire model versions on their own schedule, and behavior can shift under you. Self-hosting an open-weight model means the weights are frozen until you choose to change them, which matters for regulated validation, reproducible evaluation, and any system where a silent behavior change is a defect. ## The Open Models and Serving Stacks Self-hosting is viable today mainly because open-weight models have become genuinely capable. Meta's Llama family and Mistral's models (including their mixture-of-experts variants) are the common starting points, alongside other strong open releases. They span a wide size range, from small models that run on a single modest GPU to large ones that need multi-GPU servers. For many enterprise tasks, especially when paired with good retrieval, a mid-sized open model is sufficient, and you do not need the largest model to get acceptable quality. The serving layer is where a lot of practical performance lives. The model weights alone do not make an efficient endpoint; the inference server does. The main options, conceptually: | Serving stack | Best fit | Strengths | Trade-offs | |---|---|---|---| | **vLLM** | High-throughput production serving | Continuous batching, PagedAttention for efficient memory use, strong concurrency | More setup than a one-line tool; GPU-focused | | **Ollama** | Local development, prototyping, single-user | Trivial to run, manages models simply, friendly CLI | Not built for high-concurrency production load | | **TGI (Text Generation Inference)** | Production serving in the Hugging Face ecosystem | Mature, good tooling and integrations, batching | Tied to its ecosystem conventions | | **Triton Inference Server** | Mixed multi-model fleets, enterprise MLOps | Serves many model types, mature ops features, hardware backends | Heavier to operate; broader than just LLMs | A useful mental model: **Ollama** is what you run on a laptop to try a model; **vLLM** or **TGI** is what you put behind a real application that serves concurrent users; **Triton** is what a platform team standardizes on when serving many models of many types at scale. The difference between a naive server and a properly batched one like vLLM is not marginal. Continuous batching can multiply the number of concurrent requests a single GPU sustains, which directly changes the cost math from the previous section. ## Hardware Sizing and the Quantization Trade-off The first sizing question is memory, not raw compute. A model has to fit in GPU VRAM along with its key-value cache, which grows with the number of concurrent requests and the length of their contexts. A rough rule for full 16-bit precision: VRAM needed for the weights is approximately twice the parameter count in billions, expressed in gigabytes. A 7-to-8 billion parameter model needs on the order of 16 GB just for weights; a 70 billion parameter model needs around 140 GB, which means multiple GPUs. Then you must add headroom for the KV cache, which at high concurrency and long context can rival the weights in size. Under-provisioning the KV cache is a common cause of "it worked in testing and fell over in production." **Quantization** is the main lever for fitting larger models onto smaller hardware. By storing weights at lower numerical precision (8-bit, or 4-bit formats such as those used by GPTQ and AWQ), you cut the memory footprint substantially, often roughly halving it at 8-bit and quartering it at 4-bit relative to 16-bit. That can move a model from "needs two GPUs" to "fits on one," with a corresponding drop in cost. The trade-off is quality: aggressive quantization can degrade accuracy, and the degree varies by model and task. Eight-bit quantization is usually close to lossless for most workloads; four-bit is often acceptable and sometimes not, and the only honest way to know is to evaluate on your own task rather than trusting a generic benchmark. We treat quantization as something to validate per use case, not assume, which is part of how we approach [Machine Learning Model Development](/services/machine-learning-model-development). The practical sizing sequence we use: pick the smallest model that passes your quality bar on a representative evaluation set, then choose quantization that preserves that bar, then size the GPU for the resulting memory footprint plus realistic concurrent-load KV cache, then load-test before committing to capacity. ## Security and Isolation Self-hosting changes your security posture rather than simply improving it. You remove the third-party-access risk, and you take on the full responsibility for securing the deployment yourself, which is not free. The core controls for a private deployment: the model and its endpoint live inside a private network segment with no public ingress, reachable only through your own authenticated application layer; all traffic is encrypted in transit and any stored data (logs, caches, fine-tuning datasets) is encrypted at rest; access is governed by role-based controls and least privilege, so the people who can reach the model and its training data are a deliberately small set; and every request and administrative action is logged for audit, which is frequently a compliance requirement in itself. For genuinely air-gapped deployments, the discipline extends to the supply chain: model weights, container images, and dependencies must be brought across the boundary through a controlled process and verified, because there is no live connection to pull updates or patches. This is real operational weight, and it should be counted honestly in the decision. An air-gapped environment that nobody patches is not more secure, it is differently insecure. The benefit of running inside your own boundary is real, but it is contingent on actually doing the work that the provider was previously doing for you. ## Fine-Tuning vs. RAG A frequent assumption is that private hosting requires fine-tuning the model on your data. Usually it does not, and reaching for fine-tuning first is a common and expensive mistake. **Retrieval-augmented generation (RAG)** keeps the model fixed and supplies relevant information at query time by retrieving from your own knowledge base and placing it in the prompt. It is the right default for the most common need, which is making a model answer accurately about your specific documents, policies, products, or records. RAG keeps knowledge current (you update the data, not the model), is far cheaper to maintain, and keeps your proprietary data in a retrieval store you control rather than baked into weights. For most private-hosting use cases, a capable open model plus a well-built retrieval pipeline meets the requirement. **Fine-tuning** changes the model's weights and earns its keep for a narrower set of goals: teaching a consistent output format or style, adapting to specialized domain language the base model handles poorly, or improving reliability on a narrow, high-volume task. It does not reliably "add knowledge" the way people expect, and a fine-tuned model still goes stale as your information changes. The pragmatic path is RAG first, measure, and fine-tune only the specific gaps that retrieval cannot close. The two compose well: a fine-tuned model serving a RAG pipeline. This retrieval-first approach is central to how we build [Generative AI Solutions](/services/generative-ai-solutions) on private infrastructure. ## A Decision Framework Putting it together, here is the sequence we apply when a team asks whether to self-host. **Start with the constraints, not the technology.** Ask whether any of the five drivers applies with real force: a compliance or residency obligation that an enterprise API tier with a BAA cannot satisfy; IP sensitive enough that contractual non-training assurances are insufficient; sustained, high, predictable GPU utilization; a hard latency or offline requirement; or a need for frozen, version-controlled model behavior. If none applies strongly, stop, use a hosted API, and revisit later. **If a driver applies, locate the lightest sufficient point on the spectrum.** A single-tenant hosted deployment or an enterprise API with a BAA may satisfy a compliance need without you running any GPUs. Self-hosting in your own VPC is the next step up. Full air-gap is the heaviest and should be reserved for requirements that genuinely demand it. Do not skip to the most isolated option because it feels safest; each step adds permanent operational cost. **Then size honestly.** Estimate real average utilization, not peak. Run the cost comparison against your actual projected API spend. Pick the smallest model that passes a real evaluation on your task, validate quantization rather than assuming it, and load-test the serving stack before committing capacity. Account for the people-cost of operating GPUs, patching, and on-call, which is the line item most often left out. **Be honest about the burden.** Self-hosting means you own uptime, scaling, security patching, model updates, and the team to do all of it. For organizations without existing MLOps capability, that burden is frequently larger than the API bill it was meant to replace. This is the single most common reason a self-hosting project disappoints: the infrastructure works, but the total cost of ownership was underestimated. A worked example of the framework in practice: in regulated healthcare, the compliance and IP drivers are usually decisive and the latency case is often strong too, which is why a clinical tool like our [AI Clinical Assistant](/products/ai-clinical-assistant) is designed to run within a controlled environment where patient data never leaves the boundary. The driver there is not cost, it is that the constraint leaves no acceptable alternative. By contrast, a consumer support chatbot with spiky traffic and no special data sensitivity almost always belongs on a hosted API, and building private infrastructure for it would be a clear over-engineering error. ## Conclusion The right answer is workload-specific, and the discipline is to choose based on your actual constraints rather than on a general preference for control or a general fear of cost. Hosted APIs are the correct default. Private and air-gapped hosting earns its place when compliance, IP, sustained volume, latency, or version control make it the better engineering decision, and the value is real when those drivers are real. If you are weighing this for a specific workload, our [Private LLM Hosting & Deployment](/services/private-llm-hosting) team can help you locate it on the spectrum and size it without over-building. The clearest place to start is a short, no-obligation audit of your use case and constraints at [/demo](/demo), where the goal is a straight answer about whether self-hosting is worth it for you, including the cases where it is not.