DeepSpeed
DeepSpeed copied to clipboard
Fix crash when creating Torch tensor on NPU with device=get_accelerator().current_device()
Creating a Torch tensor with the parameter device=get_accelerator().current_device() can result in a crash when using an NPU.
This issue arises because the current_device API across all accelerators is expected to return a device id as an integer, according to the interface docs.
However, specifying device as an interger when creating tensors by default directs Torch to use the CUDA backend, which leads to crash on NPUs (and potentially other accelerators as well).
To resolve this, we should use get_accelerator().current_device_name() instead, which returns the correct device identifier strings such as "npu:0", "cuda:0", or "xpu:0". This API provides the appropriate context needed for creating tensors on specific hardware accelerators.
I also notice that device=get_accelerator().current_device() is used across several files under deepspeed/inference, and may also lead to crash on other accelerators.
@microsoft-github-policy-service agree
Hi, @tjruwase . Would you please review this PR?
BTW, this issue happened before. See https://github.com/microsoft/DeepSpeed/pull/3933.
[like] Olatunji Ruwase reacted to your message:
From: minchao @.> Sent: Friday, April 26, 2024 9:44:56 AM To: microsoft/DeepSpeed @.> Cc: Olatunji Ruwase @.>; Mention @.> Subject: Re: [microsoft/DeepSpeed] Fix crash when creating Torch tensor on NPU with device=get_accelerator().current_device() (PR #5464)
Hi, @tjruwasehttps://github.com/tjruwase . Would you please review this PR?
BTW, this issue happened before. See #3933https://github.com/microsoft/DeepSpeed/pull/3933.
— Reply to this email directly, view it on GitHubhttps://github.com/microsoft/DeepSpeed/pull/5464#issuecomment-2079026185, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABAS34D4BV2RJ5HFDQT2YNDY7IOZRAVCNFSM6AAAAABGYZW3G2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZZGAZDMMJYGU. You are receiving this because you were mentioned.Message ID: @.***>
Hi, @tjruwase. Just fix a formatting issue in new commit. Would you please approve the workflows to run?
Hi, @tjruwase, the nv-accelerate-v100 and nv-lightening-v100 CI workflows seems to be down.
Could you please take a look? Thx.
I did a search for current_device() in DeepSpeed repo and it looks like most occurance of current_device() should be current_device_name() in order to be compatible to non-cuda device. Maybe increase test coverage on non-cuda device would help catch such usage in the future.
Hi, @tjruwase @loadams, All checks have passed, could you add this PR to the merge queue?
Hi, @tjruwase, this PR was removed from merge queue by the bot, not sure why it was removed, could you please check it out?
Hi, @tjruwase, this PR was removed from merge queue by the bot, not sure why it was removed, could you please check it out?
Hi @harygo2 - looks like a transient failure on the merge queue system, I'll re-queue.