torchrec [bugfix] Fix error on empty sharders

[bugfix] Fix error on empty sharders

Open zhuzilin opened this issue 3 years ago • 4 comments

When passing a empty sharder list to DMP, the sharder_map will be a empty dict, which would return True for not sharder_map. In our experience, it's beneficial to support passing empty sharders to DMP to align the performance.

May 24 '22 07:05 zhuzilin

Why do you want to pass an empty sharder?

May 24 '22 07:05 xing-liu

@xing-liu I found that when training with DMP on 1 gpu, the model will divergent within several steps, while the no DMP version convergent fine. Alowing to send empty sharder list would help me make sure if it is the problem of the ddp part or the sharder.

May 24 '22 10:05 zhuzilin

Can you share your model code?

May 24 '22 15:05 xing-liu

@xing-liu I'm afraid I can't share the model... And also I've fixed the divergent problem yesterday, it was some misuse of torchrec...

May 25 '22 03:05 zhuzilin

torchrec torchrec copied to clipboard

[bugfix] Fix error on empty sharders

torchrec
torchrec copied to clipboard