DeepSpeed Fix crash when creating Torch tensor on NPU with device=get_accelerator().current

Creating a Torch tensor with the parameter device=get_accelerator().current_device() can result in a crash when using an NPU.

This issue arises because the current_device API across all accelerators is expected to return a device id as an integer, according to the interface docs.

However, specifying device as an interger when creating tensors by default directs Torch to use the CUDA backend, which leads to crash on NPUs (and potentially other accelerators as well).

To resolve this, we should use get_accelerator().current_device_name() instead, which returns the correct device identifier strings such as "npu:0", "cuda:0", or "xpu:0". This API provides the appropriate context needed for creating tensors on specific hardware accelerators.

I also notice that device=get_accelerator().current_device() is used across several files under deepspeed/inference, and may also lead to crash on other accelerators.

Apr 25 '24 12:04 harygo2

@microsoft-github-policy-service agree

Apr 25 '24 12:04 harygo2

Hi, @tjruwase . Would you please review this PR?

BTW, this issue happened before. See https://github.com/microsoft/DeepSpeed/pull/3933.

Apr 26 '24 09:04 minchao-sun

[like] Olatunji Ruwase reacted to your message:

From: minchao @.> Sent: Friday, April 26, 2024 9:44:56 AM To: microsoft/DeepSpeed @.> Cc: Olatunji Ruwase @.>; Mention @.> Subject: Re: [microsoft/DeepSpeed] Fix crash when creating Torch tensor on NPU with device=get_accelerator().current_device() (PR #5464)

Hi, @tjruwasehttps://github.com/tjruwase . Would you please review this PR?

BTW, this issue happened before. See #3933https://github.com/microsoft/DeepSpeed/pull/3933.

— Reply to this email directly, view it on GitHubhttps://github.com/microsoft/DeepSpeed/pull/5464#issuecomment-2079026185, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABAS34D4BV2RJ5HFDQT2YNDY7IOZRAVCNFSM6AAAAABGYZW3G2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZZGAZDMMJYGU. You are receiving this because you were mentioned.Message ID: @.***>

Apr 26 '24 11:04 tjruwase

Hi, @tjruwase. Just fix a formatting issue in new commit. Would you please approve the workflows to run?

Apr 28 '24 01:04 harygo2

Hi, @tjruwase, the nv-accelerate-v100 and nv-lightening-v100 CI workflows seems to be down.

Could you please take a look? Thx.

Apr 29 '24 03:04 harygo2

I did a search for current_device() in DeepSpeed repo and it looks like most occurance of current_device() should be current_device_name() in order to be compatible to non-cuda device. Maybe increase test coverage on non-cuda device would help catch such usage in the future.

Apr 30 '24 07:04 delock

Hi, @tjruwase @loadams, All checks have passed, could you add this PR to the merge queue?

May 03 '24 13:05 harygo2

Hi, @tjruwase, this PR was removed from merge queue by the bot, not sure why it was removed, could you please check it out?

May 06 '24 02:05 harygo2

Hi, @tjruwase, this PR was removed from merge queue by the bot, not sure why it was removed, could you please check it out?

Hi @harygo2 - looks like a transient failure on the merge queue system, I'll re-queue.

May 06 '24 16:05 loadams

DeepSpeed DeepSpeed copied to clipboard

Fix crash when creating Torch tensor on NPU with device=get_accelerator().current_device()

DeepSpeed
DeepSpeed copied to clipboard