accelerate icon indicating copy to clipboard operation
accelerate copied to clipboard

Pytorch nightlies break Accelerate for no CUDA devices.

Open Vargol opened this issue 1 year ago • 2 comments
trafficstars

System Info

The current PyTorch nightly ( 2.3.0.dev20240314) breaks accelerate.
torch.cuda functions are no longer no-ops on non cuda devices.

so code like


    # clean pre and post foward hook
    if is_npu_available():
        torch.npu.empty_cache()
    elif is_xpu_available():
        torch.xpu.empty_cache()
    else:
        torch.cuda.empty_cache()

from  accelerate/utils/modeling.py (~ line 440 ) now fails with 

File "/Volumes/SSD2TB/AI/InvokeAI/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 443, in set_module_tensor_to_device
    torch.cuda.empty_cache()
  File "/Volumes/SSD2TB/AI/InvokeAI/lib/python3.10/site-packages/torch/cuda/memory.py", line 161, in empty_cache
    _lazy_init()
  File "/Volumes/SSD2TB/AI/InvokeAI/lib/python3.10/site-packages/torch/cuda/__init__.py", line 284, in _lazy_init
    raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled

when used on an MPS device.

Information

  • [ ] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [ ] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • [X] My own task or dataset (give details below)

Reproduction

load a model through diffusers on a MPS device. The script I'm using can be cut down too

from diffusers import AutoencoderKL
import torch

vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix",
                                    torch_dtype=torch.float16,
                                    force_upcast=False).to('mps')

fails with the following error

Volumes/SSD2TB/AI/InvokeAI/lib/python3.10/site-packages/diffusers/utils/outputs.py:63: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  torch.utils._pytree._register_pytree_node(
/Volumes/SSD2TB/AI/InvokeAI/lib/python3.10/site-packages/diffusers/utils/outputs.py:63: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  torch.utils._pytree._register_pytree_node(
Traceback (most recent call last):
  File "/Volumes/SSD2TB/AI/Diffusers/break.py", line 4, in <module>
    vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix",
  File "/Volumes/SSD2TB/AI/InvokeAI/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
    return fn(*args, **kwargs)
  File "/Volumes/SSD2TB/AI/InvokeAI/lib/python3.10/site-packages/diffusers/models/modeling_utils.py", line 669, in from_pretrained
    unexpected_keys = load_model_dict_into_meta(
  File "/Volumes/SSD2TB/AI/InvokeAI/lib/python3.10/site-packages/diffusers/models/modeling_utils.py", line 159, in load_model_dict_into_meta
    set_module_tensor_to_device(model, param_name, device, value=param, dtype=dtype)
  File "/Volumes/SSD2TB/AI/InvokeAI/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 443, in set_module_tensor_to_device
    torch.cuda.empty_cache()
  File "/Volumes/SSD2TB/AI/InvokeAI/lib/python3.10/site-packages/torch/cuda/memory.py", line 161, in empty_cache
    _lazy_init()
  File "/Volumes/SSD2TB/AI/InvokeAI/lib/python3.10/site-packages/torch/cuda/__init__.py", line 284, in _lazy_init
    raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled

Expected behavior

model loads and the script exits without error

Vargol avatar Mar 14 '24 14:03 Vargol

Does this work on a lower torch version?

If that particular branch is being ran than the assumption is that an accelerator is used (in this case it's trying to use your mps)

muellerzr avatar Mar 14 '24 14:03 muellerzr

It worked on a torch 2.3.0.dev20240221 and the current release and all the older releases I've used

I'm trying to work my way through the torch code, I think they may have already backed out the change, or at least partially backed it out hard to tell.

I think this is the backout https://github.com/pytorch/pytorch/commit/a2a4693c1babace14de13c344993b0070b74bd9c

Vargol avatar Mar 14 '24 14:03 Vargol

@Vargol can you verify if it's still broken now that it's been a bit?

muellerzr avatar Mar 25 '24 15:03 muellerzr

Hi

It is now working in Version: 2.4.0.dev20240321, the versions of 2.3.0 and 2.2.2 I pulled from https://download.pytorch.org/whl/test/cpu/ and current 2.2.1.

So It seems to have been fixed or was backed out.

Vargol avatar Mar 25 '24 15:03 Vargol

Thanks for checking @Vargol !

muellerzr avatar Mar 25 '24 15:03 muellerzr