Giyeong Oh

Results 40 comments of Giyeong Oh

> hi @BootsofLagrangian I am using Windows. Could you please share your accelerate config? And an example of the run script for training? Thank you so much! Here are a...

Accelerate does this thing. In [accelerator.accumulate](https://github.com/huggingface/accelerate/blob/159c0dd02a42c30545821b7287376fe4be04d5ee/src/accelerate/accelerator.py#L1046) context manager, accelerate synchronize gradients and loss via [sync_gradients](https://github.com/huggingface/accelerate/blob/159c0dd02a42c30545821b7287376fe4be04d5ee/src/accelerate/accelerator.py#L1020). sd-scripts utilizes accelerate from Hugging Face, it is very helpful to do high-level distributed learning.

aceelerate launch --num_processes=[NUM_YOUR_GPUS_PER_MACHINE] --num_machines=[NUM_YOUR_INDEPENDENT_MACHINES] --multi_gpus --gpu_ids=[GPU_IDS] "train_network.py" args... If you have 4 gpus and one machine, give args as accelerate launch --num_processes=4 --multi_gpu --num_machines=1 --gpu_ids=0,1,2,3 "train_network.py" args...

> > aceelerate launch --num_processes=[NUM_YOUR_GPUS_PER_MACHINE] --num_machines=[NUM_YOUR_INDEPENDENT_MACHINES] --multi_gpus --gpu_ids=[GPU_IDS] "train_network.py" args... > > If you have 4 gpus and one machine, give args as accelerate launch --num_processes=4 --multi_gpu --num_machines=1 --gpu_ids=0,1,2,3 "train_network.py"...

@BotLifeGamer Here is a example command lines for training lora `accelerate launch --num_processes=2 --multi_gpu --num_machines=1 --gpu_ids=0,1 "train_network.py" --pretrained_model_name_or_path=[huggingface_path or base model path to use] --network_module=networks.lora --save_model_as=safetensors --caption_extension=".txt" --seed="42" --training_comment=[some comment...

@Charmandrigo Sorry for that I have only experience of one machine training. But I think accelerate support multi-machine training. If you run `accelerate config`, you can find options for multi-machine...

> [@BootsofLagrangian](https://github.com/BootsofLagrangian) Your 4 GPUs are from the same brand? Do you know if it's possible to use AMD alonside NVIDIA? Yes, 4x RTX3090. Heterogeneous device training is a really...

Do your H100s connect via NVLink? or just PCIe? If PCIe is, speed degradation occurs due to PCIe communication bottleneck.

> > Do your H100s connect via NVLink? or just PCIe? If PCIe is, speed degradation occurs due to PCIe communication bottleneck. > > ok it turns out all are...

> @BootsofLagrangian it is not like i purchased them i am using on Massed Compute :) > > They said they have SXM4 A100. I will test the script there....