DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

Fix crash when creating Torch tensor on NPU with device=get_accelerator().current_device()

Open harygo2 opened this issue 1 year ago • 7 comments

Creating a Torch tensor with the parameter device=get_accelerator().current_device() can result in a crash when using an NPU.

This issue arises because the current_device API across all accelerators is expected to return a device id as an integer, according to the interface docs.

However, specifying device as an interger when creating tensors by default directs Torch to use the CUDA backend, which leads to crash on NPUs (and potentially other accelerators as well).

To resolve this, we should use get_accelerator().current_device_name() instead, which returns the correct device identifier strings such as "npu:0", "cuda:0", or "xpu:0". This API provides the appropriate context needed for creating tensors on specific hardware accelerators.

I also notice that device=get_accelerator().current_device() is used across several files under deepspeed/inference, and may also lead to crash on other accelerators.

harygo2 avatar Apr 25 '24 12:04 harygo2

@microsoft-github-policy-service agree

harygo2 avatar Apr 25 '24 12:04 harygo2

Hi, @tjruwase . Would you please review this PR?

BTW, this issue happened before. See https://github.com/microsoft/DeepSpeed/pull/3933.

minchao-sun avatar Apr 26 '24 09:04 minchao-sun

[like] Olatunji Ruwase reacted to your message:


From: minchao @.> Sent: Friday, April 26, 2024 9:44:56 AM To: microsoft/DeepSpeed @.> Cc: Olatunji Ruwase @.>; Mention @.> Subject: Re: [microsoft/DeepSpeed] Fix crash when creating Torch tensor on NPU with device=get_accelerator().current_device() (PR #5464)

Hi, @tjruwasehttps://github.com/tjruwase . Would you please review this PR?

BTW, this issue happened before. See #3933https://github.com/microsoft/DeepSpeed/pull/3933.

— Reply to this email directly, view it on GitHubhttps://github.com/microsoft/DeepSpeed/pull/5464#issuecomment-2079026185, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABAS34D4BV2RJ5HFDQT2YNDY7IOZRAVCNFSM6AAAAABGYZW3G2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZZGAZDMMJYGU. You are receiving this because you were mentioned.Message ID: @.***>

tjruwase avatar Apr 26 '24 11:04 tjruwase

Hi, @tjruwase. Just fix a formatting issue in new commit. Would you please approve the workflows to run?

harygo2 avatar Apr 28 '24 01:04 harygo2

Hi, @tjruwase, the nv-accelerate-v100 and nv-lightening-v100 CI workflows seems to be down. image image

Could you please take a look? Thx.

harygo2 avatar Apr 29 '24 03:04 harygo2

I did a search for current_device() in DeepSpeed repo and it looks like most occurance of current_device() should be current_device_name() in order to be compatible to non-cuda device. Maybe increase test coverage on non-cuda device would help catch such usage in the future.

delock avatar Apr 30 '24 07:04 delock

Hi, @tjruwase @loadams, All checks have passed, could you add this PR to the merge queue?

harygo2 avatar May 03 '24 13:05 harygo2

Hi, @tjruwase, this PR was removed from merge queue by the bot, not sure why it was removed, could you please check it out?

harygo2 avatar May 06 '24 02:05 harygo2

Hi, @tjruwase, this PR was removed from merge queue by the bot, not sure why it was removed, could you please check it out?

Hi @harygo2 - looks like a transient failure on the merge queue system, I'll re-queue.

loadams avatar May 06 '24 16:05 loadams