Zach Mueller comments

Results 443 comments of


                                            Zach Mueller

trafficstars

[Core] add auto `device_map` support to pipelines

@SunMarc updating cuda drivers will solve this :) (Tested on the 4090's): No need for fancy env settings etc, just do `python myscript.py` ```python import torch from accelerate.utils import send_to_device...

how to cleanly exit when using `accelerate launch`

@vladmandic can you just do a try/catch in what gradio is running? I need an example of your gradio script and how it's being utilized for more ideas.

how to cleanly exit when using `accelerate launch`

@vladmandic your attempt there is right, you need to call it purly pythonic and avoid subprocess when possible, that is the only way.

how to cleanly exit when using `accelerate launch`

When trying to catch something like this, yes. We cannot guarantee that accelerate is free of subprocess, as certain configurations require us launching it like so. I don't believe we...

how to cleanly exit when using `accelerate launch`

Hi @vladmandic, I found another issue requesting the same thing. I'll look into this here relatively soon (promise not months out like before), and we'll put in explicit exceptions for...

The onnx-cpu extra is installing fastai2

@polyrand try installing via: ``` pip install fastinference[onnxcpu] ``` Instead.

[BUG] Zero3 with zero_init will error if the config is created before dist init

@mrwyattii this is because we're integrating with `Accelerate` to handle all the distributed code in `Trainer`. You can see the code we use to set everything up on the `Accelerate`...

[BUG] Zero3 with zero_init will error if the config is created before dist init

Doing so will make `tests/deepspeed/test_deepspeed.py::TestDeepSpeedWithLauncher::test_basic_distributed_zero3_fp16` fail, with the same error as stated. Please try running with: `CUDA_VISIBLE_DEVICES="0,1" RUN_SLOW="yes" ACCELERATE_USE_DEEPSPEED="yes" pytest -sv tests/deepspeed/test_deepspeed.py -k test_basic_distributed` to replicate

Multi-GPU OOM when resuming from checkpoint

Can you try installing via main? Aka: `pip install git+https://github.com/huggingface/accelerate`? And ideally can you tell us the output of `accelerate env`?

Can accelerate train a single model on multiple TPU VMs (not a TPU Pod)?

@codetomakechange out of curiosity have you found a way to do this *without* accelerate launch? (aka native torch-xla?) If not that's okay, will look into this soon