physicsnemo icon indicating copy to clipboard operation
physicsnemo copied to clipboard

Add memory-efficient InfiniteHashSampler and infinite sampler tests

Open gertln opened this issue 6 months ago • 0 comments

PhysicsNeMo Pull Request

Description

Introduces InfiniteHashSampler, a new memory-efficient infinite sampler designed for very large datasets (billion+ samples) that uses hash-based randomization without storing full index arrays. Tests for both infinite samplers have been added.

  • Hash-Based Randomization: Deterministic pseudo-random sampling using efficient hash function
  • Distributed Training Support: Full compatibility with DistributedDataParallel (DDP)
  • Billion-Scale Ready: Tested with datasets up to 10 billion samples
  • Sequential Fallback: Option to disable randomization for sequential access

Checklist

  • [x] I am familiar with the Contributing Guidelines.
  • [x] New or existing tests cover these changes.
  • [ ] The documentation is up to date with these changes.
  • [x] The CHANGELOG.md is up to date with these changes.
  • [ ] An issue is linked to this pull request.

Dependencies

gertln avatar Jun 10 '25 14:06 gertln