TransformerEngine
TransformerEngine copied to clipboard
Why is the result of context-parallel DotProductAttention influenced by the random seed?
Hi! When I want to replace the regular attention calculation with context-parallel DotProductAttention, I find that the results of DotProductAttention are influenced by different random seeds, and the outputs are not completely aligned. How can I resolve this situation?