Zach Mueller comments

Results 368 comments of


                                            Zach Mueller

Feature request - SLURM support

Nope, you do not. That is also extremely valid (and why the non-yaml option exists, for situations where we need to wrap/call it separately and a yaml makes it complicated)

FP8 training causes OOM

Hi all, we finally narrowed down the two sources of leakage in the implementation that we could improve. #2089 will fix this, reducing your memory by a _significant amount_. For...

Model Parallelism and accelerate's usage of DDP aren't compatible

@maxidl can you share your modified code? Curious what those exceptions are that exist for "no good reason"

Model Parallelism and accelerate's usage of DDP aren't compatible

Thanks @maxidl, as an approach here's what the team has decided we will do: 1. I'll put a PR in today that let's you *explicitly disable* the blocking behavior, and...

ERROR occurs when accelerate test in multi-gpu training

What kind of gpu setup are you using?

ERROR occurs when accelerate test in multi-gpu training

@DragonDRLI can you try specifying "gpu_ids" as "all" in your config? Check `vim ~/.cache/huggingface/accelerate/default_config.yaml` and do: ``` gpu_ids: all ``` (Notice no quotes)

ERROR occurs when accelerate test in multi-gpu training

@DragonDRLI can you try perhaps upgrading your torch version? (Doubtful, but having some issues recreating this). E.g.: `pip install light-the-torch; ltt install torch torchvision -U`

My validation set is too large. I want to randomly sample 0.1 during validation.

As Sylvain says, it's your dataset that's the issue. I would recommend ensuring that there are enough samples for at least 1 full batch between all your GPUs (so if...

Can't use accelerate to launch two programs on one machine

@efsotr during my tests I'm able to have it all work properly, however you'll need to specify a new port in your config to launch on, which may stem your...

Training with Accelerator Fails. RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:7! (when checking argument for argument index in method wrapper__index_select)

Big model inference is only for _inference_, not training at this time. Oops: I'm wrong!