Prefetch vs VRAM: Factory Simulator

Think of training like a conveyor factory. Loader produces MB/s into pinned host RAM, H→D transfer moves MB/s into VRAM, and compute drains MB/s. Tune rates and capacities to see which part blocks first and where OOM risk appears.

Pipeline State

1. DataLoader + Prefetch

Creates CPU-side batches. If queue is full, loader blocks. Prefetch mostly consumes pinned host RAM.
Prefetch queue occupancy

2. H→D Transfer

Moves batch from pinned RAM to VRAM. Fast copy helps, but cannot beat queue bounds or tensor lifetime issues.
VRAM backlog waiting for compute

3. GPU Compute

Consumes VRAM-resident bytes. Slow compute or high backlog capacity can hold tensors longer and push VRAM toward OOM.
VRAM backlog occupancy
Pinned host RAM usage
VRAM usage
Current diagnosis Running baseline simulation...
Loader packets Transfer packets Compute packets
Produced throughput0
Transferred throughput0
Consumed throughput0
Leading bottleneck-