Zach Mueller comments

Results 435 comments of


                                            Zach Mueller

trafficstars

[feature request] accelerate launcher: add numa affiinities control

Tbh though, the `pynvml` solution makes more sense, we can add it as a CLI option and just raise an err if it's not installed. Let me work on that...

[feature request] accelerate launcher: add numa affiinities control

It is not, looks like we'll need to do it the hard way without pynvml (and just run a series of bash things) given that.

[feature request] accelerate launcher: add numa affiinities control

No worries, while un-fun, I'm getting it working with some subprocess ;)

[feature request] accelerate launcher: add numa affiinities control

@stas00 if you want to try some bleeding edge stuff, just pushed some commits. Haven't fully tested it on a multi-gpu system yet, but at least the dry run of...

[feature request] accelerate launcher: add numa affiinities control

Let's start small with the nvidia version, then we can add the AMD and gaudi2 as follow ups. (Since we can only test the nvidia-smi version rn)

[feature request] accelerate launcher: add numa affiinities control

@stas00 please see https://github.com/huggingface/accelerate/pull/2535 :)

Distributed Training Gets Stuck?

@pjspol can you try setting `dispatch_batches=False` in the accelerator potentially? (I can check with lucidrains too in case thats a bit behind his apis some). This is a known bug...

Distributed Training Gets Stuck?

@thevasudevgupta the recommended solution of `dispatch_batches=False` is still a requirement due to changes with the torch dataloader that have led to these issues and requires significant rewrite for us to...

Distributed Training Gets Stuck?

The same answer as above, don’t use batch dispatching.

lr_scheduler not updated when auto_find_batch_size set to True and batch_size decays

@raghavanone @thomas-schillaci could you try building from main and seeing if that fixes the issue? I think https://github.com/huggingface/transformers/pull/24521 fixed this