Zach Mueller
Zach Mueller
@faaany lmk if this is good to merge
If you mean for training, no that's not supported.
We support that via DeepSpeed/FSDP weight offloading. We're looking into native pipeline parallelism soon
No, we do not.
Is that the full error trace? It seems like some of it may be cut off
It is total number of GPUs, we then reduce it by `num_machines`. (That SLURM example looks to be wrong possibly)
I'm stating the launcher will reduce it. `--num_processes` is the *total* number of GPUs and assumes each node has the same number of GPUs on each. So rather than `--n-proc-per-node=2`...
`num_processes` and `process_index` get their information from `torch.distributed.get_world_size()` and `torch.distributed.get_rank()` `if accelerator.is_main_process` should only run on the main node and its first process. `is_local_main_process` would be ran 4 times, one...
Please provide us your code and the full stack trace/error log
Might be good to have this as an alternative choice, from their docs: MS-AMP has the following benefit comparing with Transformer Engine: Speed up memory-limited operations by accessing one byte...