Add PyTorch DataLoader Evaluator plugin
- Introduces a lightweight diagnostic tool for identifying data loading bottlenecks in PyTorch training pipelines.
- This change adds Loader Evaluator inside pytorch DALI plugin, a jupyter notebook tutorial, and a documentation page with tests
- LoaderEvaluator class wraps PyTorch DataLoader with performance monitoring Two operation modes: 'log' (normal iteration with metrics) and 'replay' (cached batches for ideal performance simulation)
- PerformanceMetrics class for detailed performance tracking and bottleneck analysis
- In-memory batch caching for replay mode to simulate ideal data loading
- Comprehensive test suite and documentation with example notebook
- The tool helps users compare real vs. ideal data loading performance and identify optimization opportunities.
Authored-by: Albert Wolant [email protected]
Category:
New feature (non-breaking change which adds functionality)
Description:
- Introduces a lightweight diagnostic tool for identifying data loading bottlenecks in PyTorch training pipelines.
- This change adds Loader Evaluator inside pytorch DALI plugin, a jupyter notebook tutorial, and a documentation page with tests
- LoaderEvaluator class wraps PyTorch DataLoader with performance monitoring Two operation modes: 'log' (normal iteration with metrics) and 'replay' (cached batches for ideal performance simulation)
- PerformanceMetrics class for detailed performance tracking and bottleneck analysis
- In-memory batch caching for replay mode to simulate ideal data loading
- Comprehensive test suite and documentation with example notebook
- The tool helps users compare real vs. ideal data loading performance and identify optimization opportunities.
Additional information:
Affected modules and functionalities:
- new module in Pytorch plugin
- new example
- new test for it
- new documentation page describing the overall idea
Key points relevant for the review:
- overall idea and flow
Tests:
- [ ] Existing tests apply
- [x] New tests added
- [x] Python tests
- test_pytorch_loader_evaluator.py
- [ ] GTests
- [ ] Benchmark
- [ ] Other
- [x] Python tests
- [ ] N/A
Checklist
Documentation
- [ ] Existing documentation applies
- [x] Documentation updated
- [ ] Docstring
- [ ] Doxygen
- [x] RST
- pytorch_data_loader_evaluator.rst
- [x] Jupyter
- pytorch_data_loader_evaluator.ipynb
- [ ] Other
- [ ] N/A
DALI team only
Requirements
- [ ] Implements new requirements
- [ ] Affects existing requirements
- [x] N/A
REQ IDs: N/A
JIRA TASK: DALI-4299
Check out this pull request on ![]()
See visual diffs & provide feedback on Jupyter Notebooks.
Powered by ReviewNB
CI MESSAGE: [39670512]: BUILD STARTED
!build
CI MESSAGE: [39670654]: BUILD STARTED
Greptile Overview
Greptile Summary
This PR adds a new LoaderEvaluator diagnostic tool to the PyTorch plugin for identifying data loading bottlenecks in training pipelines.
Key Changes:
- New
LoaderEvaluatorclass that wraps PyTorch DataLoader with two modes:- log mode: Normal iteration with performance metrics collection
- replay mode: Caches batches in memory and replays them to simulate ideal (zero-overhead) data loading
- Comprehensive test suite with edge case coverage
- Documentation including RST pages and a Jupyter notebook tutorial demonstrating the bottleneck detection workflow
How It Works: The tool allows users to compare real data loading performance against ideal performance by caching a small number of batches and replaying them. If replay mode is significantly faster, it indicates a data loading bottleneck that could benefit from optimization (e.g., more workers, faster storage, or DALI).
Integration:
- Cleanly integrates into the existing
nvidia.dali.plugin.pytorchnamespace - Test added to the pytorch test suite in
qa/TL0_python-self-test-core/test_body.sh
Confidence Score: 4/5
- This PR is safe to merge - it adds a new optional diagnostic tool with no impact on existing functionality.
- Score of 4 reflects well-tested new functionality with comprehensive documentation. The implementation is straightforward and follows existing patterns in the codebase. Minor deduction because replay mode assumes DataLoader supports len() which may not work with IterableDataset.
- loader.py - consider documenting the len() requirement for replay mode
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| dali/python/nvidia/dali/plugin/pytorch/loader_evaluator/loader.py | 4/5 | Core LoaderEvaluator implementation with two modes (log/replay). Well-structured with proper error handling. Minor observation: replay mode may fail if DataLoader doesn't support len() (e.g., IterableDataset). |
| dali/test/python/test_pytorch_loader_evaluator.py | 5/5 | Comprehensive test suite covering basic functionality, modes, metrics, edge cases, and error conditions. Good coverage. |
| docs/examples/frameworks/pytorch/loader_evaluator/pytorch_data_loader_evaluator.ipynb | 5/5 | Well-written tutorial notebook demonstrating bottleneck detection workflow with clear explanations and practical example. |
| docs/plugins/pytorch_data_loader_evaluator.rst | 5/5 | Comprehensive documentation explaining the tool's purpose, technical approach, and comparison with alternatives (nsys, PyTorch Profiler). |
Sequence Diagram
sequenceDiagram
participant User as User Code
participant LE as LoaderEvaluator
participant DL as PyTorch DataLoader
participant Cache as Batch Cache
Note over User,Cache: Log Mode (Baseline Performance)
User->>LE: for batch in loader (mode="log")
LE->>DL: iter(dataloader)
loop Each Batch
LE->>DL: next()
DL-->>LE: batch
LE->>LE: Record batch_time
LE-->>User: yield batch
end
User->>LE: get_metrics()
LE-->>User: {total_time, avg_batch_time, ...}
Note over User,Cache: Replay Mode (Ideal Performance)
User->>LE: LoaderEvaluator(dataloader, mode="replay")
LE->>DL: iter(dataloader) [during construction]
loop Cache Batches
DL-->>LE: batch
LE->>Cache: append(batch)
end
User->>LE: for batch in loader
loop Each Batch (from cache)
LE->>Cache: get cached batch[i % cache_size]
Cache-->>LE: batch
LE->>LE: Record batch_time
LE-->>User: yield batch
end
User->>LE: get_metrics()
LE-->>User: {total_time, avg_batch_time, ...}
@greptileai please review again.
!build
CI MESSAGE: [39675479]: BUILD STARTED
@greptileai please review again.
@greptileai please review again.
!build
CI MESSAGE: [39676718]: BUILD STARTED
CI MESSAGE: [39670654]: BUILD FAILED
CI MESSAGE: [39676718]: BUILD FAILED
!build
CI MESSAGE: [39693133]: BUILD STARTED
CI MESSAGE: [39693133]: BUILD FAILED
CI MESSAGE: [39693133]: BUILD PASSED
!build
CI MESSAGE: [39788983]: BUILD STARTED
CI MESSAGE: [39788983]: BUILD FAILED
CI MESSAGE: [39788983]: BUILD PASSED
!build
CI MESSAGE: [39900875]: BUILD STARTED
!build
CI MESSAGE: [39901633]: BUILD STARTED
CI MESSAGE: [39901633]: BUILD FAILED
CI MESSAGE: [39901633]: BUILD PASSED