Cloud GPU pricing looks straightforward until you actually run a training job at scale. The real cost of AI training on AWS, Azure, or GCP includes egress fees, idle billing, checkpoint storage, and reserved instance commitments that lock you in before you understand your actual utilization pattern. For most teams running sustained training workloads, GPU colocation is 40–50% cheaper once you account for the full picture.
What Does a Cloud GPU Instance Actually Cost?
The headline number is never the real number. AWS lists the p4d.24xlarge — 8x A100s — at roughly $32/hour on-demand. That sounds manageable until you do the math at scale.
A single serious training run on a large language model might take 3–6 weeks of continuous compute. At $32/hour, that's $16,000–$23,000 for one run on one instance. Now add a team doing iterative training, hyperparameter sweeps, and parallel experiments across multiple instances. You're looking at $150K–$300K per quarter before you've touched storage or egress.
Reserved instances cut the hourly rate — sometimes by 30–40% — but they require 1- or 3-year commitments. You're betting on your utilization pattern before you know what it is. If your training workload shifts or your architecture changes (and it will), you're either paying for idle capacity or eating the break fees.
The Costs That Don't Show Up in the GPU Line Item
Here's where finance teams get surprised:
Egress fees. AWS charges $0.09/GB to move data out. Azure is similar. If you're training on a 10TB dataset and pulling checkpoints back to your own storage, that's $900 in egress for one transfer. Teams doing active development — pulling model weights, evaluation outputs, logs — can hit $5,000–$15,000/month in egress alone without realizing it.
Checkpoint storage. A modern LLM checkpoint can run 50–200GB. Save one every few hours during a multi-week run and you're storing terabytes of checkpoint data at $0.10/GB/month or more. That's $500–$2,000/month just for training artifacts, on top of your dataset storage.
Idle billing. GPU instances bill by the hour whether your job is running or not. Queue failures, environment setup, debugging sessions, and job restarts all burn compute time. On a busy cluster, 15–20% idle time is common. You're paying for GPUs that are doing nothing.
Inter-region transfer. Distributed training across availability zones isn't free. If your training cluster spans zones — which AWS recommends for resilience — you're paying data transfer costs between nodes in addition to the compute bill.
Add it up and the real cost of cloud GPU training is often 1.5–2x the instance price alone.
What Does the Same Workload Cost in Colocation?
Let me give you a concrete scenario. Say you're running a 4-cabinet H100 cluster: 32 H100 GPUs total, drawing around 480kW of IT load at full utilization.
| Cost Component | Cloud (AWS p4d equivalent) | GPU Colocation (IDACORE East) |
|---|---|---|
| Compute/power (monthly) | ~$200,000–$240,000 | ~$120,000 ($250/kW × 480kW) |
| Egress (10TB/month) | ~$900–$1,500 | $0 (your hardware, your network) |
| Checkpoint storage (5TB) | ~$500–$1,000/month | Included in your own storage |
| Idle billing (15%) | ~$30,000–$36,000 | $0 (you own the hardware) |
| Estimated monthly total | ~$231,000–$278,000 | ~$120,000 |
That's not a rounding error. That's $1.3M–$1.9M per year versus roughly $1.44M — and the colo number includes your hardware amortized over 3–4 years. If you own your GPUs outright, the monthly colo cost drops to just the power and facility fee.
The cloud number also assumes you're getting reserved pricing. On-demand, you're looking at another 30–40% on top.
Why Traditional Data Centers Can't Host This Hardware
Before you start calling data centers, understand the density problem. A single DGX H100 system draws about 10.2kW. A standard rack of 8 DGX H100s hits 80–100kW depending on configuration. Most traditional colocation facilities are built for 5–15kW per cabinet — the kind of density that works fine for web servers and databases.
Put an H100 cluster in a standard colo and you'll either trip breakers or get told the rack can't be deployed. This isn't a theoretical concern. It's why so many teams end up defaulting to cloud — not because cloud is cheaper, but because finding a facility that can actually host the hardware is hard.
IDACORE East in Eastern Oregon is built specifically for this density profile: 120kW per cabinet, direct-to-chip liquid cooling, true 2N power (independent grid source plus gas generation — not a generator that kicks in when utility fails, but two independent sources running simultaneously). The facility targets a PUE of around 1.10, which matters because at 480kW of IT load, even a 0.1 PUE difference is meaningful money.
What 2N Power Actually Means for Training Jobs
A training run that gets interrupted at hour 400 of a 500-hour job doesn't just waste the last 100 hours. Depending on your checkpoint frequency, you might lose 6–12 hours of compute. At $32/hour per instance across a multi-GPU cluster, that's a five-figure loss from a single power event.
"Generator backup" means you have utility power and a generator that starts when utility fails. There's a transfer window — usually 10–30 seconds — where nothing is running. For most workloads, that's fine. For a GPU training job, that's a crash.
True 2N means two independent power paths, both live simultaneously. No transfer window. No single point of failure. IDACORE East runs independent grid source plus gas generation as co-primary sources, not primary-plus-backup. That distinction matters when you're pricing the cost of a failed training run.
When Does Cloud GPU Still Make Sense?
I'm not going to tell you cloud is always wrong. It isn't.
Cloud GPU makes sense when you're in early experimentation — running jobs that last hours or days, not weeks. When your utilization is unpredictable and you genuinely can't forecast whether you'll need 10 GPUs or 100 next month. When you're a team of three and you don't want to manage hardware procurement and colocation contracts.
The math flips toward colocation when utilization goes above 60–70% consistently, when your training runs are measured in weeks, and when you're moving significant data volumes. At that point, you're paying the cloud premium for flexibility you're not actually using.
Most teams doing serious AI development hit that threshold faster than they expect. The experimentation phase ends, the architecture stabilizes, and suddenly you're running the same training pipeline month after month on cloud hardware you don't own, paying egress fees on data you're generating, and wondering why the GPU line item in your cloud bill keeps growing.
Frequently Asked Questions
How much does it cost to train AI models on cloud GPUs vs colocation?
Cloud GPU instances like AWS p4d.24xlarge run $32–$40/hour on-demand, or roughly $200K–$280K/year reserved. GPU colocation at facilities like IDACORE East runs around $250/kW/month all-in. A 4-cabinet H100 cluster drawing 480kW costs about $120K/month in colo — often 40–50% less than equivalent reserved cloud capacity once you account for egress, storage, and idle billing.
What hidden costs should I expect from cloud GPU training?
The big ones are egress fees ($0.08–$0.09/GB on AWS and Azure), idle GPU billing when jobs queue or fail, inter-region data transfer for distributed training, and snapshot/checkpoint storage at $0.10/GB/month or more. A model with 100GB checkpoints saved every few hours can generate thousands of dollars monthly in storage costs alone — none of which shows up in your GPU line item.
Is GPU colocation right for AI training workloads?
It's the right fit when your GPU utilization is consistently above 60–70%, your training runs are measured in weeks or months rather than hours, and you're moving large datasets in and out regularly. At that point, the fixed monthly cost of colocation almost always beats variable cloud pricing. It's less ideal for sporadic experimentation where you need GPUs for a few days and then nothing.
What power density do H100 GPU clusters actually require?
A single DGX H100 system draws around 10.2kW. A rack of 8 DGX H100s hits roughly 80–100kW depending on configuration and cooling approach. Most traditional data centers cap out at 10–20kW per cabinet, which means they physically can't host modern GPU clusters. You need a facility built for 80–120kW/cabinet with direct-to-chip liquid cooling — which is exactly what IDACORE East is designed for.
How does data residency affect AI training infrastructure decisions?
If you're training on healthcare records, financial data, or government datasets, you likely have contractual or regulatory requirements about where that data lives and whether it crosses state lines. Cloud providers route data across regions by default and offer residency controls that are complex to configure and audit. Colocation in a single facility — like IDACORE East in Eastern Oregon — gives you physical data residency that's straightforward to document for compliance purposes.
If you're running sustained AI training workloads and your cloud bill keeps climbing, the numbers above are worth running against your own utilization data. IDACORE East is purpose-built for exactly this density profile — 120kW/cabinet, true 2N power, liquid cooling, and $250/kW/month all-in with no egress surprises. We're currently pre-leasing Phase 1 capacity via LOI. Talk to us about your training infrastructure requirements before you sign another reserved instance commitment.