DALI icon indicating copy to clipboard operation
DALI copied to clipboard

Add PyTorch DataLoader Evaluator plugin

Open JanuszL opened this issue 4 weeks ago • 28 comments

  • Introduces a lightweight diagnostic tool for identifying data loading bottlenecks in PyTorch training pipelines.
  • This change adds Loader Evaluator inside pytorch DALI plugin, a jupyter notebook tutorial, and a documentation page with tests
  • LoaderEvaluator class wraps PyTorch DataLoader with performance monitoring Two operation modes: 'log' (normal iteration with metrics) and 'replay' (cached batches for ideal performance simulation)
  • PerformanceMetrics class for detailed performance tracking and bottleneck analysis
  • In-memory batch caching for replay mode to simulate ideal data loading
  • Comprehensive test suite and documentation with example notebook
  • The tool helps users compare real vs. ideal data loading performance and identify optimization opportunities.

Authored-by: Albert Wolant [email protected]

Category:

New feature (non-breaking change which adds functionality)

Description:

  • Introduces a lightweight diagnostic tool for identifying data loading bottlenecks in PyTorch training pipelines.
  • This change adds Loader Evaluator inside pytorch DALI plugin, a jupyter notebook tutorial, and a documentation page with tests
  • LoaderEvaluator class wraps PyTorch DataLoader with performance monitoring Two operation modes: 'log' (normal iteration with metrics) and 'replay' (cached batches for ideal performance simulation)
  • PerformanceMetrics class for detailed performance tracking and bottleneck analysis
  • In-memory batch caching for replay mode to simulate ideal data loading
  • Comprehensive test suite and documentation with example notebook
  • The tool helps users compare real vs. ideal data loading performance and identify optimization opportunities.

Additional information:

Affected modules and functionalities:

  • new module in Pytorch plugin
  • new example
  • new test for it
  • new documentation page describing the overall idea

Key points relevant for the review:

  • overall idea and flow

Tests:

  • [ ] Existing tests apply
  • [x] New tests added
    • [x] Python tests
      • test_pytorch_loader_evaluator.py
    • [ ] GTests
    • [ ] Benchmark
    • [ ] Other
  • [ ] N/A

Checklist

Documentation

  • [ ] Existing documentation applies
  • [x] Documentation updated
    • [ ] Docstring
    • [ ] Doxygen
    • [x] RST
      • pytorch_data_loader_evaluator.rst
    • [x] Jupyter
      • pytorch_data_loader_evaluator.ipynb
    • [ ] Other
  • [ ] N/A

DALI team only

Requirements

  • [ ] Implements new requirements
  • [ ] Affects existing requirements
  • [x] N/A

REQ IDs: N/A

JIRA TASK: DALI-4299

JanuszL avatar Dec 05 '25 10:12 JanuszL

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

CI MESSAGE: [39670512]: BUILD STARTED

dali-automaton avatar Dec 05 '25 10:12 dali-automaton

!build

JanuszL avatar Dec 05 '25 10:12 JanuszL

CI MESSAGE: [39670654]: BUILD STARTED

dali-automaton avatar Dec 05 '25 10:12 dali-automaton

Greptile Overview

Greptile Summary

This PR adds a new LoaderEvaluator diagnostic tool to the PyTorch plugin for identifying data loading bottlenecks in training pipelines.

Key Changes:

  • New LoaderEvaluator class that wraps PyTorch DataLoader with two modes:
    • log mode: Normal iteration with performance metrics collection
    • replay mode: Caches batches in memory and replays them to simulate ideal (zero-overhead) data loading
  • Comprehensive test suite with edge case coverage
  • Documentation including RST pages and a Jupyter notebook tutorial demonstrating the bottleneck detection workflow

How It Works: The tool allows users to compare real data loading performance against ideal performance by caching a small number of batches and replaying them. If replay mode is significantly faster, it indicates a data loading bottleneck that could benefit from optimization (e.g., more workers, faster storage, or DALI).

Integration:

  • Cleanly integrates into the existing nvidia.dali.plugin.pytorch namespace
  • Test added to the pytorch test suite in qa/TL0_python-self-test-core/test_body.sh

Confidence Score: 4/5

  • This PR is safe to merge - it adds a new optional diagnostic tool with no impact on existing functionality.
  • Score of 4 reflects well-tested new functionality with comprehensive documentation. The implementation is straightforward and follows existing patterns in the codebase. Minor deduction because replay mode assumes DataLoader supports len() which may not work with IterableDataset.
  • loader.py - consider documenting the len() requirement for replay mode

Important Files Changed

File Analysis

Filename Score Overview
dali/python/nvidia/dali/plugin/pytorch/loader_evaluator/loader.py 4/5 Core LoaderEvaluator implementation with two modes (log/replay). Well-structured with proper error handling. Minor observation: replay mode may fail if DataLoader doesn't support len() (e.g., IterableDataset).
dali/test/python/test_pytorch_loader_evaluator.py 5/5 Comprehensive test suite covering basic functionality, modes, metrics, edge cases, and error conditions. Good coverage.
docs/examples/frameworks/pytorch/loader_evaluator/pytorch_data_loader_evaluator.ipynb 5/5 Well-written tutorial notebook demonstrating bottleneck detection workflow with clear explanations and practical example.
docs/plugins/pytorch_data_loader_evaluator.rst 5/5 Comprehensive documentation explaining the tool's purpose, technical approach, and comparison with alternatives (nsys, PyTorch Profiler).

Sequence Diagram

sequenceDiagram
    participant User as User Code
    participant LE as LoaderEvaluator
    participant DL as PyTorch DataLoader
    participant Cache as Batch Cache

    Note over User,Cache: Log Mode (Baseline Performance)
    User->>LE: for batch in loader (mode="log")
    LE->>DL: iter(dataloader)
    loop Each Batch
        LE->>DL: next()
        DL-->>LE: batch
        LE->>LE: Record batch_time
        LE-->>User: yield batch
    end
    User->>LE: get_metrics()
    LE-->>User: {total_time, avg_batch_time, ...}

    Note over User,Cache: Replay Mode (Ideal Performance)
    User->>LE: LoaderEvaluator(dataloader, mode="replay")
    LE->>DL: iter(dataloader) [during construction]
    loop Cache Batches
        DL-->>LE: batch
        LE->>Cache: append(batch)
    end
    User->>LE: for batch in loader
    loop Each Batch (from cache)
        LE->>Cache: get cached batch[i % cache_size]
        Cache-->>LE: batch
        LE->>LE: Record batch_time
        LE-->>User: yield batch
    end
    User->>LE: get_metrics()
    LE-->>User: {total_time, avg_batch_time, ...}

greptile-apps[bot] avatar Dec 05 '25 10:12 greptile-apps[bot]

@greptileai please review again.

JanuszL avatar Dec 05 '25 12:12 JanuszL

!build

JanuszL avatar Dec 05 '25 12:12 JanuszL

CI MESSAGE: [39675479]: BUILD STARTED

dali-automaton avatar Dec 05 '25 12:12 dali-automaton

@greptileai please review again.

JanuszL avatar Dec 05 '25 12:12 JanuszL

@greptileai please review again.

JanuszL avatar Dec 05 '25 13:12 JanuszL

!build

JanuszL avatar Dec 05 '25 13:12 JanuszL

CI MESSAGE: [39676718]: BUILD STARTED

dali-automaton avatar Dec 05 '25 13:12 dali-automaton

CI MESSAGE: [39670654]: BUILD FAILED

dali-automaton avatar Dec 05 '25 17:12 dali-automaton

CI MESSAGE: [39676718]: BUILD FAILED

dali-automaton avatar Dec 05 '25 17:12 dali-automaton

!build

JanuszL avatar Dec 05 '25 18:12 JanuszL

CI MESSAGE: [39693133]: BUILD STARTED

dali-automaton avatar Dec 05 '25 18:12 dali-automaton

CI MESSAGE: [39693133]: BUILD FAILED

dali-automaton avatar Dec 05 '25 21:12 dali-automaton

CI MESSAGE: [39693133]: BUILD PASSED

dali-automaton avatar Dec 07 '25 11:12 dali-automaton

!build

JanuszL avatar Dec 08 '25 05:12 JanuszL

CI MESSAGE: [39788983]: BUILD STARTED

dali-automaton avatar Dec 08 '25 05:12 dali-automaton

CI MESSAGE: [39788983]: BUILD FAILED

dali-automaton avatar Dec 08 '25 14:12 dali-automaton

CI MESSAGE: [39788983]: BUILD PASSED

dali-automaton avatar Dec 08 '25 14:12 dali-automaton

!build

JanuszL avatar Dec 09 '25 17:12 JanuszL

CI MESSAGE: [39900875]: BUILD STARTED

dali-automaton avatar Dec 09 '25 17:12 dali-automaton

!build

JanuszL avatar Dec 09 '25 17:12 JanuszL

CI MESSAGE: [39901633]: BUILD STARTED

dali-automaton avatar Dec 09 '25 18:12 dali-automaton

CI MESSAGE: [39901633]: BUILD FAILED

dali-automaton avatar Dec 09 '25 19:12 dali-automaton

CI MESSAGE: [39901633]: BUILD PASSED

dali-automaton avatar Dec 09 '25 19:12 dali-automaton