DeepEP
DeepEP copied to clipboard
Maybe use reference stash to replace record stream to reduce mem peak
Due to behavior of CudaCacheAllocator, record stream will lead to a late memory free, which has significant on memory peak. (Could refer to fsdp1's issue due to recordstream). In Pytorch 2.8 +,c10d has remove all recordStream in collective communication. Thus, I wonder if there are plans to remove record stream and use reference stash to handle multistream scenario?
https://github.com/deepseek-ai/DeepEP/pull/456 A tested PR, could be further discussed.