torchrec icon indicating copy to clipboard operation
torchrec copied to clipboard

[bugfix] Fix error on empty sharders

Open zhuzilin opened this issue 3 years ago • 4 comments

When passing a empty sharder list to DMP, the sharder_map will be a empty dict, which would return True for not sharder_map. In our experience, it's beneficial to support passing empty sharders to DMP to align the performance.

zhuzilin avatar May 24 '22 07:05 zhuzilin

Why do you want to pass an empty sharder?

xing-liu avatar May 24 '22 07:05 xing-liu

@xing-liu I found that when training with DMP on 1 gpu, the model will divergent within several steps, while the no DMP version convergent fine. Alowing to send empty sharder list would help me make sure if it is the problem of the ddp part or the sharder.

zhuzilin avatar May 24 '22 10:05 zhuzilin

Can you share your model code?

xing-liu avatar May 24 '22 15:05 xing-liu

@xing-liu I'm afraid I can't share the model... And also I've fixed the divergent problem yesterday, it was some misuse of torchrec...

zhuzilin avatar May 25 '22 03:05 zhuzilin