Reza Yazdani

Results 95 comments of Reza Yazdani

Hi @arthur-morgan-712 I am not sure what exactly is happening here, since in the trace log I am seeing it says that it is trying to load the extension module...

We already have examples for running for some transformer networks. For this argument, I think you might just add __local_rank__ to your parser arguments the same as [here](https://github.com/microsoft/DeepSpeedExamples/blob/20ea07a2a069696abec212e25476a9bf76aced70/bing_bert/utils.py#L51-L54).

Thanks @stas00 for clarifying this : )

Hi @tomerip Thanks for bringing this interesting issue. I will definitely look into this and fix it soon. Reza

Hey @tomerip, Sorry for the long delay here. We have a deadline by the end of the week, and I can get more time on this issue next week. Hopefully,...

Hi @joehoover, I added this PR, can you please try it and let me know if it works on your side? Thanks, Reza

Hi @stas00 , Thanks for tagging me here. I will definitely look into this and try to fix it soon. Best, Reza

Hi @asaparov It's great to see your issue is solved. As @stas00 mentioned the part regarding the new checkpoint loading is coming soon too. @stas00, thanks for full details here...

Hi all, There are some new changes merged at DeepSpeed master. Would you mind trying that? I have tried with batch 1 and 128 and both are working on my...

Hi @pohunghuang-nctu Sure, you need to pass `save_mp_checkpoint_path` to the `init_inference` method in order to save the tp-sharded checkpoints in the path you specified. You will see that after loading...