Curator
Curator copied to clipboard
Improve NeMo Curator Experience for Pytorch Models (with crossfit)
Is your feature request related to a problem? Please describe.
Based on user feedback we need to fix the following to make user experience better:
- [x] Enable PyTorch to share same memory pool as RMM via cli #1392
PR in
dask-cudamerged, will follow up for curator in next release - [x] Fix GPU-id logging of crossfit
- [x] Enable memory estimation of crossfit+PyTorch when RMM backend is used
- [ ] Possible Memory Estimation Issue Leading to OOMs and Restarts #72
- [ ] Semantic dedup (uses Crossfit / PyTorch) gets stuck with UCX #283
Improve Perf by adding model compilation:
- [ ] https://github.com/rapidsai/crossfit/issues/90
At risk because not all items above will complete in this sprint. Vibhu to break into two issues.
All subtasks are completed.