DeepSpeed
DeepSpeed copied to clipboard
Allow accelerator to instantiate the device
when instantiating torch.device for HPU it cannot be fed with HPU:1 annotation, but only "HPU". moving the logic to accelerator will allow to solve this issue, with single line change.
Will merge #5275 first, that way we can have the newly added unit tests run on this, then we can merge it?
@loadams Sure, thanks :)
@loadams Sure, thanks :)
Done now, so the tests should run on this PR, then we can merge it, thanks!
Halting this PR till habana version 1.16.0 will be released. CI issue should be fixed there.
Halting this PR till habana version 1.16.0 will be released. CI issue should be fixed there.
Thanks @nelyahu - when it is released, go ahead and just update it in this PR? Also curious if you can share what is required in that release that is causing failures here since it appears the tests just hang now?
And just tag one of us when it is updated and we can get this merged.
Halting this PR till habana version 1.16.0 will be released. CI issue should be fixed there.
Thanks @nelyahu - when it is released, go ahead and just update it in this PR? Also curious if you can share what is required in that release that is causing failures here since it appears the tests just hang now?
And just tag one of us when it is updated and we can get this merged.
Sure @loadams , i will update. I verified this patch on v1.16.0, as our dev process is on the next release branch, and i did not verified v1.14.0 (the release which is being tested here). There was a lot of bug fixed since then, cannot say for sure what is the root cause.
Halting this PR till habana version 1.16.0 will be released. CI issue should be fixed there.
Thanks @nelyahu - when it is released, go ahead and just update it in this PR? Also curious if you can share what is required in that release that is causing failures here since it appears the tests just hang now? And just tag one of us when it is updated and we can get this merged.
Sure @loadams , i will update. I verified this patch on v1.16.0, as our dev process is on the next release branch, and i did not verified v1.14.0 (the release which is being tested here). There was a lot of bug fixed since then, cannot say for sure what is the root cause.
Thanks @nelyahu - feel free to tag us when it the new version is released and updated so we can merge this then, thanks!
Halting this PR till habana version 1.16.0 will be released. CI issue should be fixed there.
Thanks @nelyahu - when it is released, go ahead and just update it in this PR? Also curious if you can share what is required in that release that is causing failures here since it appears the tests just hang now? And just tag one of us when it is updated and we can get this merged.
Sure @loadams , i will update. I verified this patch on v1.16.0, as our dev process is on the next release branch, and i did not verified v1.14.0 (the release which is being tested here). There was a lot of bug fixed since then, cannot say for sure what is the root cause.
Thanks @nelyahu - feel free to tag us when it the new version is released and updated so we can merge this then, thanks!
@nelyahu - I tried 1.15.1, looks like same errors, so we will wait for 1.16
Halting this PR till habana version 1.16.0 will be released. CI issue should be fixed there.
Thanks @nelyahu - when it is released, go ahead and just update it in this PR? Also curious if you can share what is required in that release that is causing failures here since it appears the tests just hang now? And just tag one of us when it is updated and we can get this merged.
Sure @loadams , i will update. I verified this patch on v1.16.0, as our dev process is on the next release branch, and i did not verified v1.14.0 (the release which is being tested here). There was a lot of bug fixed since then, cannot say for sure what is the root cause.
Thanks @nelyahu - feel free to tag us when it the new version is released and updated so we can merge this then, thanks!
@nelyahu - I tried 1.15.1, looks like same errors, so we will wait for 1.16
@loadams Yes.
@nelyahu - now that we have updated the hpu runner to 1.17, should we move ahead with merging this PR?
@nelyahu - now that we have updated the hpu runner to 1.17, should we move ahead with merging this PR?
@nelyahu/ @BacharL - looks like there is an error with this, could you take a look when you have time?
@nelyahu - now that we have updated the hpu runner to 1.17, should we move ahead with merging this PR?
@nelyahu/ @BacharL - looks like there is an error with this, could you take a look when you have time?
Yes, this test is disabled locally in our env and fails for the same reason. I removed it from gaudi2 workflow file. once fixed will be re-added