Prefetch vs VRAM: Factory Simulator
Think of training like a conveyor factory. Loader produces MB/s into pinned host RAM, H→D transfer moves MB/s into VRAM, and compute drains MB/s. Tune rates and capacities to see which part blocks first and where OOM risk appears.
Pipeline State
1. DataLoader + Prefetch
Creates CPU-side batches. If queue is full, loader blocks. Prefetch mostly consumes pinned host RAM.
2. H→D Transfer
Moves batch from pinned RAM to VRAM. Fast copy helps, but cannot beat queue bounds or tensor lifetime issues.
VRAM backlog waiting for compute
3. GPU Compute
Consumes VRAM-resident bytes. Slow compute or high backlog capacity can hold tensors longer and push VRAM toward OOM.
Current diagnosis
Running baseline simulation...
Loader packets
Transfer packets
Compute packets
Produced throughput0
Transferred throughput0
Consumed throughput0
Leading bottleneck-