DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

Allow accelerator to instantiate the device

Open nelyahu opened this issue 11 months ago • 7 comments

when instantiating torch.device for HPU it cannot be fed with HPU:1 annotation, but only "HPU". moving the logic to accelerator will allow to solve this issue, with single line change.

nelyahu avatar Mar 11 '24 16:03 nelyahu

Will merge #5275 first, that way we can have the newly added unit tests run on this, then we can merge it?

loadams avatar Mar 27 '24 22:03 loadams

@loadams Sure, thanks :)

nelyahu avatar Mar 28 '24 07:03 nelyahu

@loadams Sure, thanks :)

Done now, so the tests should run on this PR, then we can merge it, thanks!

loadams avatar Mar 28 '24 17:03 loadams

Halting this PR till habana version 1.16.0 will be released. CI issue should be fixed there.

nelyahu avatar Apr 03 '24 10:04 nelyahu

Halting this PR till habana version 1.16.0 will be released. CI issue should be fixed there.

Thanks @nelyahu - when it is released, go ahead and just update it in this PR? Also curious if you can share what is required in that release that is causing failures here since it appears the tests just hang now?

And just tag one of us when it is updated and we can get this merged.

loadams avatar Apr 03 '24 15:04 loadams

Halting this PR till habana version 1.16.0 will be released. CI issue should be fixed there.

Thanks @nelyahu - when it is released, go ahead and just update it in this PR? Also curious if you can share what is required in that release that is causing failures here since it appears the tests just hang now?

And just tag one of us when it is updated and we can get this merged.

Sure @loadams , i will update. I verified this patch on v1.16.0, as our dev process is on the next release branch, and i did not verified v1.14.0 (the release which is being tested here). There was a lot of bug fixed since then, cannot say for sure what is the root cause.

nelyahu avatar Apr 04 '24 07:04 nelyahu

Halting this PR till habana version 1.16.0 will be released. CI issue should be fixed there.

Thanks @nelyahu - when it is released, go ahead and just update it in this PR? Also curious if you can share what is required in that release that is causing failures here since it appears the tests just hang now? And just tag one of us when it is updated and we can get this merged.

Sure @loadams , i will update. I verified this patch on v1.16.0, as our dev process is on the next release branch, and i did not verified v1.14.0 (the release which is being tested here). There was a lot of bug fixed since then, cannot say for sure what is the root cause.

Thanks @nelyahu - feel free to tag us when it the new version is released and updated so we can merge this then, thanks!

loadams avatar Apr 19 '24 15:04 loadams

Halting this PR till habana version 1.16.0 will be released. CI issue should be fixed there.

Thanks @nelyahu - when it is released, go ahead and just update it in this PR? Also curious if you can share what is required in that release that is causing failures here since it appears the tests just hang now? And just tag one of us when it is updated and we can get this merged.

Sure @loadams , i will update. I verified this patch on v1.16.0, as our dev process is on the next release branch, and i did not verified v1.14.0 (the release which is being tested here). There was a lot of bug fixed since then, cannot say for sure what is the root cause.

Thanks @nelyahu - feel free to tag us when it the new version is released and updated so we can merge this then, thanks!

@nelyahu - I tried 1.15.1, looks like same errors, so we will wait for 1.16

loadams avatar May 28 '24 18:05 loadams

Halting this PR till habana version 1.16.0 will be released. CI issue should be fixed there.

Thanks @nelyahu - when it is released, go ahead and just update it in this PR? Also curious if you can share what is required in that release that is causing failures here since it appears the tests just hang now? And just tag one of us when it is updated and we can get this merged.

Sure @loadams , i will update. I verified this patch on v1.16.0, as our dev process is on the next release branch, and i did not verified v1.14.0 (the release which is being tested here). There was a lot of bug fixed since then, cannot say for sure what is the root cause.

Thanks @nelyahu - feel free to tag us when it the new version is released and updated so we can merge this then, thanks!

@nelyahu - I tried 1.15.1, looks like same errors, so we will wait for 1.16

@loadams Yes.

nelyahu avatar May 28 '24 19:05 nelyahu

@nelyahu - now that we have updated the hpu runner to 1.17, should we move ahead with merging this PR?

loadams avatar Aug 14 '24 17:08 loadams

@nelyahu - now that we have updated the hpu runner to 1.17, should we move ahead with merging this PR?

@nelyahu/ @BacharL - looks like there is an error with this, could you take a look when you have time?

loadams avatar Aug 14 '24 21:08 loadams

@nelyahu - now that we have updated the hpu runner to 1.17, should we move ahead with merging this PR?

@nelyahu/ @BacharL - looks like there is an error with this, could you take a look when you have time?

Yes, this test is disabled locally in our env and fails for the same reason. I removed it from gaudi2 workflow file. once fixed will be re-added

nelyahu avatar Aug 15 '24 07:08 nelyahu