Ok-Topk icon indicating copy to clipboard operation
Ok-Topk copied to clipboard

Multi-Node Sparse Training Error

Open gaow0007 opened this issue 2 years ago • 3 comments

Thanks for your releasing Ok-Topk. It is an interesting work, and I am developing certain functions based this repo. I succeed in single-node training. However, when I try Ok-Topk across 2 nodes, a total 8 GPUs. I found that certain values in all_indexes are negative.

May I ask some suggestions about how to debug it?

Thanks.

gaow0007 avatar May 14 '22 08:05 gaow0007