Zach Mueller

Results 472 comments of Zach Mueller

@faaany lmk if this is good to merge

If you mean for training, no that's not supported.

We support that via DeepSpeed/FSDP weight offloading. We're looking into native pipeline parallelism soon

Is that the full error trace? It seems like some of it may be cut off

It is total number of GPUs, we then reduce it by `num_machines`. (That SLURM example looks to be wrong possibly)

I'm stating the launcher will reduce it. `--num_processes` is the *total* number of GPUs and assumes each node has the same number of GPUs on each. So rather than `--n-proc-per-node=2`...

`num_processes` and `process_index` get their information from `torch.distributed.get_world_size()` and `torch.distributed.get_rank()` `if accelerator.is_main_process` should only run on the main node and its first process. `is_local_main_process` would be ran 4 times, one...

Might be good to have this as an alternative choice, from their docs: MS-AMP has the following benefit comparing with Transformer Engine: Speed up memory-limited operations by accessing one byte...