Ok-Topk
Ok-Topk copied to clipboard
Multi-Node Sparse Training Error
Thanks for your releasing Ok-Topk. It is an interesting work, and I am developing certain functions based this repo. I succeed in single-node training. However, when I try Ok-Topk across 2 nodes, a total 8 GPUs. I found that certain values in all_indexes are negative.
May I ask some suggestions about how to debug it?
Thanks.