Stas Bekman
Stas Bekman
This issue is being addressed in: - https://github.com/microsoft/DeepSpeed/issues/2811 - https://github.com/microsoft/DeepSpeed/issues/2812 which I think should resolve the leak as well. The Deepspeed team are actively working on solving both.
Actually @tohtana has just created a PR that is supposed to fix both issues: https://github.com/microsoft/DeepSpeed/pull/2989 I will be able to try it probably tomorrow, but please go ahead and try...
@dennisbakhuis, please try with this PR https://github.com/microsoft/DeepSpeed/pull/2989 - I tested and your repro now works. you will also need to add to `ds_config.json` top-level (this is an unrelated change) ```...
Thank you for testing https://github.com/microsoft/DeepSpeed/pull/2989, @dumpmemory - sorry to hear it didn't resolve the leak - perhaps file a new issue in DS, as the one I posted I couldn't...
https://github.com/microsoft/DeepSpeed/pull/2989 has been merged, so closing this Issue. To verify the solution please use deepspeed@master until the next release (0.9.1? is made)
Thank you very much for the recommendation, @boweiliu I have it on the list, but didn't have a chance to read it yet. Your list sounds fitting the content of...
OK, so this is actually not just an advisory warning that is relevant to some, this is a problem that needs to be made users aware of by all means....
Yes, that's what HF Trainer does and if I remember correctly Megatron-LM does as well. But this only works well if you have a simple DataSampler - ideally already preprocessed...
@lhoestq, I updated the OP and was able to bisect which package and version lead to the breakage.
thanks a lot, @lhoestq @williamberrios - could you please test this asap and if all started working they can make a new release - thank you!