DeepEP icon indicating copy to clipboard operation
DeepEP copied to clipboard

Maybe use reference stash to replace record stream to reduce mem peak

Open junjzhang opened this issue 2 months ago • 1 comments

Due to behavior of CudaCacheAllocator, record stream will lead to a late memory free, which has significant on memory peak. (Could refer to fsdp1's issue due to recordstream). In Pytorch 2.8 +,c10d has remove all recordStream in collective communication. Thus, I wonder if there are plans to remove record stream and use reference stash to handle multistream scenario?

junjzhang avatar Oct 15 '25 08:10 junjzhang

https://github.com/deepseek-ai/DeepEP/pull/456 A tested PR, could be further discussed.

junjzhang avatar Oct 15 '25 15:10 junjzhang