Next-Generation GPU Blog

GPU utilization troubleshooting: the 8 bottlenecks that waste GPU hours

Daya Shankar — Wed, 18 Mar 2026 07:46:05 GMT

GPU utilization troubleshooting starts with the assumption that the GPU isn't the bottleneck in your training. A recent study on PyTorch data loading showed that default preprocessing can cause up to 76% GPU idleness. In their benchmarks, average GPU utilization was just 46.4% with PyTorch’s DataLoader. It improved to 90.45% after fixing the data prep pipeline.

If you want to stop paying for wasted GPU hours, it helps to think in “step time” slices. In practice, each step is mostly data load, host-to-device transfer, compute and communication. You should fix the largest slice first because it sets the ceiling for everything else.

Before you touch anything, capture a reproducible 10-minute baseline trace from a stable run. Include utilization, step time, dataloader time, CPU utilization, GPU memory usage and throughput. In addition, write down the exact command, commit, container tag and dataset snapshot. That repeatability lets you trust your results and spot regressions quickly.

1. Storage and I/O Bottlenecks

When storage is the limiter, GPU utilization often “sawtooths” with idle gaps between bursts of work.

You will usually see dataloader time stay high near batch boundaries while step time includes obvious waiting. A practical check is to compare data_time versus compute_time per step using framework timers or torch.profiler. Next, rerun briefly from local NVMe or a warm cache because a big throughput jump points to I/O sensitivity.

A good first move is to keep hot datasets close to the GPUs because shorter latency reduces batch wait time. After that, consider sharding many small files into tar or record formats because fewer filesystem opens reduce metadata overhead. You can also increase prefetch depth because queued batches reduce stalls at step boundaries.

2. CPU and DataLoader Bottlenecks

When CPU work is the limiter, you can have plenty of GPUs available while your CPU cores are fully pegged.

Profiling usually shows decode, tokenize or augmentation dominating each step while the GPU waits. As a quick sanity test, temporarily increase num_workers and watch both throughput and p95 step time. Then profile CPU hotspots with py-spy, perf or cProfile because the top functions usually reveal the expensive preprocessing stage.

The most reliable improvement is to move heavy preprocessing offline because one-time transforms eliminate repeated per-step CPU work. You should also prefer vectorized transforms because compiled kernels reduce Python overhead.

Persistent workers help as well because process churn adds avoidable latency. Finally, remove per-sample Python loops where you can because interpreter overhead scales poorly at high sample rates.

3. Host-to-device Transfer Bottlenecks

Sometimes reads look fast and CPU looks fine, yet you still lose time copying data onto the GPU.

In a timeline from Nsight Systems or torch.profiler, large host-to-device copy regions stand out, and overlap with compute looks weak. Measure host-to-device copy time per step, then confirm whether copies overlap compute. Compare pinned versus non-pinned memory because pinned memory enables faster DMA transfers and better async behavior.

Pinned memory is usually the first win because it reduces transfer overhead and supports asynchronous copies. You can also batch transfers to avoid many small copies because per-copy overhead becomes significant. Prefetching the next batch often helps because overlapping transfer with compute hides transfer latency.

4. Batch Size Too Small

Low utilization can also happen when the workload is simply too tiny per step.

In that case, kernels look small, and overhead dominates runtime even though nothing is obviously “slow.” Sweep batch sizes and track throughput in samples per second or tokens per second, plus p95 step time. Keep in mind that utilization can mislead you because frequent small kernels can keep the GPU busy without delivering proportional throughput.

If training stability allows it, increase batch size because more work per step amortizes launch and framework overhead. When memory is tight, gradient accumulation is often the safer path because it increases effective batch size without increasing activation memory. For variable-length inputs, bucketing usually helps because less padding increases useful work per step.

5. Mixed Precision and Tensor Core Underuse

You can see “okay” utilization and still get disappointing throughput when Tensor Cores are not doing most of the heavy lifting.

This often shows up as lower throughput than peers and more memory pressure than expected for the same model class. Confirm AMP is enabled end to end, including forward pass, loss and optimizer behavior. Also look for silent FP32 fallbacks caused by dtype, shape or layout because one fallback matmul can dominate step time.

Use BF16 or FP16 where the model is stable because reduced precision improves throughput and reduces memory bandwidth pressure. Prefer Tensor Core friendly shapes because aligned dimensions and supported layouts increase kernel efficiency. You should also verify the largest matmuls or convolutions hit fast kernels because they usually dominate total compute time.

6. Kernel Launch Overhead and Sync Points

Even with fast kernels, you can bleed time when the host repeatedly micromanages the GPU.

Profilers will show many short kernels, frequent synchronizations and high host overhead tied to logging, metrics or Python control flow. Profile a single training step and count kernel launches with Nsight Systems or your framework profiler. Search for .item(), .cpu() and explicit synchronizations because these calls introduce barriers that break overlap.

Reducing sync points is usually the cleanest improvement because each barrier forces the GPU and CPU to wait on each other. Lower logging frequency as well because metrics collection often triggers synchronizations. Where practical, fuse ops because fewer launches reduce overhead and memory traffic. If your stack supports it, graph or compile modes can help because they reduce Python orchestration per step.

7. Power, Thermals and Clock Throttling

If performance drops without code changes, hardware limits are often involved.

Clocks may fluctuate or stay low, and results can vary across runs on the same workload. Monitor clocks, temperature, power draw and throttling indicators using nvidia-smi, DCGM or node telemetry. Compare behavior across nodes and containers because contention and power limits often differ by host configuration.

Set performance mode and validate power limits because conservative limits reduce sustained clocks. Make sure airflow and cooling are adequate because thermal constraints force downclocking. If you share nodes, look for noisy neighbors because competing workloads can steal power headroom, PCIe bandwidth or CPU cycles.

8. Multi-GPU Communication and Network Bottlenecks

Scaling problems often look like “slower training” even though individual GPUs are fine.

You will notice poor scaling from 1 to N GPUs, rising step time with more GPUs and ranks waiting in all-reduce or synchronization points. Measure collective time with profiler regions or NCCL traces such as NCCL_DEBUG=INFO. Compare single-node versus multi-node runs because cross-node latency and bandwidth often dominate collective time.

Topology and placement matter because correct GPU and NIC affinity reduces collective latency. You can also tune collective settings because defaults may not match your fabric and message sizes. Try to overlap communication with compute because pipelining reduces idle time between kernels. Reduce synchronization frequency when safe because fewer global barriers reduce rank waiting.

Quick Wrap-up

The goal is to turn troubleshooting into a habit you can repeat, not a one-off rescue mission.

Run a 10-minute baseline trace on your next training job, then tag the run with command, commit, image and dataset snapshot. Break step time into data load, transfer, compute and communication, then pick the biggest slice. Apply the first fix that matches what you see, then re-measure to confirm the change actually helped.

Once you find a winner, bake it into your defaults: caching patterns, dataloader templates, pinned-memory settings, AMP policies and NCCL tuning. Add dashboards for throughput, p95 step time and idle percentage, then set regression alerts. If you revisit this weekly, your GPU hours start buying progress again.

MIG vs full-GPU: when partitioning improves ROI for AI teams

Daya Shankar — Wed, 18 Mar 2026 07:40:24 GMT

For most AI teams in 2026, the real infrastructure question is no longer simply how to get more GPUs. It is how to extract more value from the GPUs already in the rack. That is why the debate around MIG vs full-GPU has become a finance question as much as an engineering one.

A full GPU can be the right answer for giant training jobs, latency-critical serving, and memory-hungry fine-tuning. But many production stacks do not look like that all day. They look like embeddings, rerankers, small and midsize LLM inference, retrieval pipelines, notebooks, feature generation, and experiments that spike for minutes, then idle.

In that environment, GPU partitioning can turn stranded capacity into usable revenue, faster iteration, and cleaner unit economics.

Why MIG vs Full-GPU Matters More in 2026?

The timing matters. Flexera reported in 2025 that 84 percent of organizations still struggle to manage cloud spend, cloud budgets exceed targets by 17 percent on average, and 33 percent now spend more than $12 million annually on public cloud. At the same time, Datadog found that GPU instances account for 14 percent of compute costs for organizations using them, up from 10 percent a year earlier, a 40 percent jump. When GPU spend rises that quickly, idle capacity stops being a technical annoyance and becomes a board-level ROI issue.

What Multi-Instance GPU Changes?

So what changes with MIG, or Multi-Instance GPU? On NVIDIA Hopper systems, MIG partitions one physical GPU into isolated instances with dedicated memory, cache, compute cores, and memory bandwidth.

How GPU partitioning works

NVIDIA says MIG can expose up to seven instances on a single GPU, support deterministic latency and throughput, and let inference, training, and HPC jobs run at the same time without the noisy-neighbor behavior common in simple time slicing. In plain language, MIG makes a shared GPU behave more like several smaller, predictable accelerators instead of one expensive asset that teams queue for and underfill.

Why isolation matters for ROI

That matters because isolated slices can reduce waste. Instead of assigning a whole GPU to a workload that only needs a fraction of the card, platform teams can provision smaller instances with more control. That can improve utilization, reduce wait times, and raise output per GPU purchased or rented.

The 2025 Shift that Changed the Economics

AI deployment patterns changed sharply in 2025. Databricks reported that organizations put 11 times more AI models into production year over year and improved deployment efficiency from a 16-to-1 experimental-to-production ratio to 5-to-1. It also found that 76 percent of organizations using LLMs choose open-source models, often smaller and more controllable than frontier closed systems.

Stanford HAI added another crucial signal in its 2025 AI Index. The inference cost of a system performing at GPT-3.5 level fell more than 280-fold between November 2022 and October 2024, while hardware costs declined roughly 30 percent annually and energy efficiency improved 40 percent per year.

The implication is striking. Teams are deploying more models, serving more workloads, and increasingly using right-sized models. That combination makes fractional GPU allocation far more attractive than it looked in the era when every serious workload seemed to require a whole accelerator.

Why is ROI Still Hard for Many AI Teams?

Still, lower model cost does not guarantee better business returns.

Deloitte found in 2025 that 85 percent of organizations increased AI investment in the prior 12 months and 91 percent planned to increase it again, yet most respondents said a typical AI use case takes two to four years to produce satisfactory ROI. Only 6 percent reported payback in under a year.

McKinsey’s 2025 global survey tells a similar story from another angle. While 88 percent of respondents say their organizations use AI in at least one business function, nearly two-thirds have not yet begun scaling AI across the enterprise, and only 39 percent report any EBIT impact from AI.

In other words, adoption is broad, but monetization is still uneven. That is exactly where MIG can help. It does not make a weak use case strong, but it can reduce the infrastructure waste that stretches payback periods.

When Multi-Instance GPU Improves ROI?

The best case for MIG vs full-GPU is straightforward. If your workloads are bursty, modest in memory footprint, and numerous enough to fill slices but not entire cards, partitioning usually improves ROI.

Best-fit workload types for MIG

Think embedding models, reranking, small open-weight LLM endpoints, document intelligence, speech pipelines, computer vision microservices, internal copilots with moderate concurrency, and development environments. These jobs often need predictable latency and isolation more than they need every ounce of a full H100 or H200.

Operational gains beyond utilization

MIG lets platform teams right-size those services, raise average utilization, and reduce the familiar pattern where a team reserves a full GPU for a service that only uses a fraction of it. NVIDIA explicitly positions MIG for right-sized provisioning and higher data center utilization, which is why it fits so well with Kubernetes-based multi-tenant AI platforms and FinOps programs.

The Hidden ROI Lever: Team Velocity

MIG also improves organizational velocity, which is an underappreciated part of ROI. When researchers and product squads do not need to wait for a whole GPU, more people can test, validate, and ship on the same hardware pool.

Faster access, faster production

That matters because Databricks found a dramatic rise in production deployment, while McKinsey found that most companies are still stuck between pilots and scale. Faster access reduces queue time, shortens feedback loops, and can move a team from experimentation to production without adding more infrastructure.

Why this matters to platform teams

On paper, that looks like better utilization. In practice, it means a platform team can serve more internal customers and more production services before approving another GPU purchase.

When Full-GPU is Still the Better Choice?

Full-GPU still wins in several important cases. Large-scale pretraining, heavy fine-tuning, high-throughput batch inference on large models, and latency-critical serving that already saturates memory bandwidth or compute are poor candidates for partitioning.

Workloads that need the whole card

If one workload can consume the full card, slicing it just adds operational complexity and potential performance tradeoffs. The same is true for jobs that depend on maximum HBM capacity, aggressive tensor throughput, or tightly tuned throughput per watt at full occupancy.

Why full-GPU remains essential

For those teams, the cleanest path is often one service or one job per full GPU, especially when the model is big enough that every partition boundary is a constraint rather than a benefit. NVIDIA’s own framing supports this distinction by positioning MIG as a way to right-size smaller workloads and run mixed jobs in parallel, not as a universal replacement for full-card allocation.

MIG vs. Full-GPU: Your Practical Decision Framework

A practical rule works better than ideology. Choose MIG when your bottleneck is allocation inefficiency. Choose full-GPU when your bottleneck is actual GPU saturation.

Choose MIG if

Your serving stack runs many small or midsize models, teams fight over access more than they fight over latency budgets, or utilization looks jagged and inconsistent. In those cases, partitioning is likely to improve ROI.

Choose full-GPU if

Your top services already fill memory, drive high sustained utilization, and are tuned around full-card performance. In those situations, keeping the GPU whole usually makes more sense.

Conclusion

The bottom line in MIG vs full-GPU is simple. Partitioning improves ROI when it turns idle fragments of expensive compute into isolated, billable, production-grade capacity.

In 2026, that is increasingly common because enterprises are deploying more AI models, using more open source and smaller models, and facing heavier scrutiny on cloud and GPU spend. Full GPUs remain essential for the biggest and most demanding jobs.

But for many AI teams, especially those focused on inference, shared platforms, and internal product velocity, MIG is not a compromise. It is the mechanism that aligns AI infrastructure with the economics of real-world demand.