ERROR:torch.distributed.elastic.multiprocessing.api:failed?
Hello I've found some problems it`s Before Make Classes, After Finish train
If make Classes Image
The following values were not passed to `accelerate launch` and had defaults used instead:
`--num_cpu_threads_per_process` was set to `4` to improve out-of-box performance
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.
Moving 0 files to the new cache system
0it [00:00, ?it/s]
[2022-10-30 21:53:50,853] [INFO] [comm.py:633:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Fetching 12 files: 100%|██████████████████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 46951.16it/s]
Traceback (most recent call last):
File "/home/polyana/github/diffusers/examples/dreambooth/train_dreambooth.py", line 662, in <module>
main(args)
File "/home/polyana/github/diffusers/examples/dreambooth/train_dreambooth.py", line 356, in main
pipeline = StableDiffusionPipeline.from_pretrained(
File "/home/polyana/anaconda3/envs/diffusers/lib/python3.9/site-packages/diffusers/pipeline_utils.py", line 577, in from_pretrained
raise ValueError(
ValueError: Pipeline <class 'diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline'> expected {'unet', 'text_encoder', 'safety_checker', 'scheduler', 'vae', 'tokenizer', 'feature_extractor'}, but only {'unet', 'text_encoder', 'safety_checker', 'scheduler', 'vae', 'tokenizer'} were passed.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 29305) of binary: /home/polyana/anaconda3/envs/diffusers/bin/python
After Finish Train
[2022-10-30 21:55:31,658] [INFO] [stage_1_and_2.py:1763:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4294967296, reducing to 2147483648.0
Steps: 10%|██████▉ | 1/10 [00:07<01:09, 7.68s/it, loss=0.232, lr=5e-6][2022-10-30 21:55:32,800] [INFO] [stage_1_and_2.py:1763:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2147483648.0, reducing to 1073741824.0
Steps: 20%|██████████████ | 2/10 [00:08<00:30, 3.83s/it, loss=0.51, lr=5e-6][2022-10-30 21:55:33,940] [INFO] [stage_1_and_2.py:1763:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1073741824.0, reducing to 536870912.0
Steps: 30%|████████████████████▋ | 3/10 [00:09<00:18, 2.60s/it, loss=0.437, lr=5e-6][2022-10-30 21:55:35,048] [INFO] [stage_1_and_2.py:1763:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 536870912.0, reducing to 268435456.0
Steps: 40%|███████████████████████████▌ | 4/10 [00:11<00:12, 2.01s/it, loss=0.312, lr=5e-6][2022-10-30 21:55:36,162] [INFO] [stage_1_and_2.py:1763:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 268435456.0, reducing to 134217728.0
Steps: 50%|██████████████████████████████████▌ | 5/10 [00:12<00:08, 1.69s/it, loss=0.592, lr=5e-6][2022-10-30 21:55:37,244] [INFO] [stage_1_and_2.py:1763:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 134217728.0, reducing to 67108864.0
Steps: 60%|█████████████████████████████████████████▍ | 6/10 [00:13<00:05, 1.48s/it, loss=0.451, lr=5e-6][2022-10-30 21:55:38,335] [INFO] [stage_1_and_2.py:1763:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 67108864.0, reducing to 33554432.0
Steps: 70%|████████████████████████████████████████████████▎ | 7/10 [00:14<00:04, 1.35s/it, loss=0.158, lr=5e-6][2022-10-30 21:55:39,482] [INFO] [stage_1_and_2.py:1763:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 33554432.0, reducing to 16777216.0
Steps: 80%|████████████████████████████████████████████████████████ | 8/10 [00:15<00:02, 1.29s/it, loss=0.28, lr=5e-6][2022-10-30 21:55:40,565] [INFO] [stage_1_and_2.py:1763:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 16777216.0, reducing to 8388608.0
Steps: 90%|██████████████████████████████████████████████████████████████ | 9/10 [00:16<00:01, 1.22s/it, loss=0.237, lr=5e-6][2022-10-30 21:55:41,609] [INFO] [stage_1_and_2.py:1763:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 8388608.0, reducing to 4194304.0
Fetching 12 files: 100%|███████████████████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 7842.26it/s]
Traceback (most recent call last): | 0/12 [00:00<?, ?it/s]
File "/home/polyana/github/diffusers/examples/dreambooth/train_dreambooth.py", line 662, in <module>
main(args)
File "/home/polyana/github/diffusers/examples/dreambooth/train_dreambooth.py", line 646, in main
pipeline = StableDiffusionPipeline.from_pretrained(
File "/home/polyana/anaconda3/envs/diffusers/lib/python3.9/site-packages/diffusers/pipeline_utils.py", line 577, in from_pretrained
raise ValueError(
ValueError: Pipeline <class 'diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline'> expected {'unet', 'safety_checker', 'text_encoder', 'vae', 'feature_extractor', 'tokenizer', 'scheduler'}, but only {'unet', 'text_encoder', 'vae', 'tokenizer', 'scheduler'} were passed.
Steps: 100%|████████████████████████████████████████████████████████████████████| 10/10 [00:20<00:00, 2.04s/it, loss=0.101, lr=5e-6]
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 29405) of binary: /home/polyana/anaconda3/envs/diffusers/bin/python
I Use this sh file with WSL2 ubuntu 22.04 and DeepSpeed
export LD_LIBRARY_PATH=/usr/lib/wsl/lib:$LD_LIBRARY_PATH
export MODEL_NAME="****/****"
export INSTANCE_DIR="training"
export CLASS_DIR="classes"
export OUTPUT_DIR="model"
accelerate launch train_dreambooth.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--instance_data_dir=$INSTANCE_DIR \
--class_data_dir=$CLASS_DIR \
--output_dir=$OUTPUT_DIR \
--with_prior_preservation --prior_loss_weight=1.0 \
--instance_prompt="photo of a" \
--class_prompt="photo of a \
--resolution=512 \
--train_batch_size=1 \
--sample_batch_size=1 \
--gradient_accumulation_steps=1 --gradient_checkpointing \
--learning_rate=5e-6 \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--num_class_images=10 \
--max_train_steps=10 \
--mixed_precision=fp16
If I don't want to see torch.distributed.elastic.multiprocessing.api:failed) error, is there any correction?
I got the same problem when use 'accelerate config' with deepseed option.
I got the same problem when use 'accelerate config' with deepseed option.
I think it's related to this code
valueError: Pipeline <class 'diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline'> expected {'unet', 'text_encoder', 'safety_checker', 'scheduler', 'vae', 'tokenizer', 'feature_extractor'}, but only {'unet', 'text_encoder', 'safety_checker', 'scheduler', 'vae', 'tokenizer'} were passed.
I'm sad that I can't solve it cuz I'm not an expert :(
@patil-suraj could you take a look here?
Hi, this issue has been fixed in recent diffusers versions, which allows safety_checker=None, could you please update your diffusers version pip install -U diffusers. That should fix it.
Hi, this issue has been fixed in recent
diffusersversions, which allowssafety_checker=None, could you please update your diffusers versionpip install -U diffusers. That should fix it.
Still get the error after execute 'pip install -U diffusers'
diffusers version is 0.8.1
@patil-suraj can you help here?
Hello I've found some problems it`s Before Make Classes, After Finish train
If make Classes Image
The following values were not passed to `accelerate launch` and had defaults used instead: `--num_cpu_threads_per_process` was set to `4` to improve out-of-box performance To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`. The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`. Moving 0 files to the new cache system 0it [00:00, ?it/s] [2022-10-30 21:53:50,853] [INFO] [comm.py:633:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl Fetching 12 files: 100%|██████████████████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 46951.16it/s] Traceback (most recent call last): File "/home/polyana/github/diffusers/examples/dreambooth/train_dreambooth.py", line 662, in <module> main(args) File "/home/polyana/github/diffusers/examples/dreambooth/train_dreambooth.py", line 356, in main pipeline = StableDiffusionPipeline.from_pretrained( File "/home/polyana/anaconda3/envs/diffusers/lib/python3.9/site-packages/diffusers/pipeline_utils.py", line 577, in from_pretrained raise ValueError( ValueError: Pipeline <class 'diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline'> expected {'unet', 'text_encoder', 'safety_checker', 'scheduler', 'vae', 'tokenizer', 'feature_extractor'}, but only {'unet', 'text_encoder', 'safety_checker', 'scheduler', 'vae', 'tokenizer'} were passed. ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 29305) of binary: /home/polyana/anaconda3/envs/diffusers/bin/pythonAfter Finish Train
[2022-10-30 21:55:31,658] [INFO] [stage_1_and_2.py:1763:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4294967296, reducing to 2147483648.0 Steps: 10%|██████▉ | 1/10 [00:07<01:09, 7.68s/it, loss=0.232, lr=5e-6][2022-10-30 21:55:32,800] [INFO] [stage_1_and_2.py:1763:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2147483648.0, reducing to 1073741824.0 Steps: 20%|██████████████ | 2/10 [00:08<00:30, 3.83s/it, loss=0.51, lr=5e-6][2022-10-30 21:55:33,940] [INFO] [stage_1_and_2.py:1763:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1073741824.0, reducing to 536870912.0 Steps: 30%|████████████████████▋ | 3/10 [00:09<00:18, 2.60s/it, loss=0.437, lr=5e-6][2022-10-30 21:55:35,048] [INFO] [stage_1_and_2.py:1763:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 536870912.0, reducing to 268435456.0 Steps: 40%|███████████████████████████▌ | 4/10 [00:11<00:12, 2.01s/it, loss=0.312, lr=5e-6][2022-10-30 21:55:36,162] [INFO] [stage_1_and_2.py:1763:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 268435456.0, reducing to 134217728.0 Steps: 50%|██████████████████████████████████▌ | 5/10 [00:12<00:08, 1.69s/it, loss=0.592, lr=5e-6][2022-10-30 21:55:37,244] [INFO] [stage_1_and_2.py:1763:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 134217728.0, reducing to 67108864.0 Steps: 60%|█████████████████████████████████████████▍ | 6/10 [00:13<00:05, 1.48s/it, loss=0.451, lr=5e-6][2022-10-30 21:55:38,335] [INFO] [stage_1_and_2.py:1763:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 67108864.0, reducing to 33554432.0 Steps: 70%|████████████████████████████████████████████████▎ | 7/10 [00:14<00:04, 1.35s/it, loss=0.158, lr=5e-6][2022-10-30 21:55:39,482] [INFO] [stage_1_and_2.py:1763:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 33554432.0, reducing to 16777216.0 Steps: 80%|████████████████████████████████████████████████████████ | 8/10 [00:15<00:02, 1.29s/it, loss=0.28, lr=5e-6][2022-10-30 21:55:40,565] [INFO] [stage_1_and_2.py:1763:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 16777216.0, reducing to 8388608.0 Steps: 90%|██████████████████████████████████████████████████████████████ | 9/10 [00:16<00:01, 1.22s/it, loss=0.237, lr=5e-6][2022-10-30 21:55:41,609] [INFO] [stage_1_and_2.py:1763:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 8388608.0, reducing to 4194304.0 Fetching 12 files: 100%|███████████████████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 7842.26it/s] Traceback (most recent call last): | 0/12 [00:00<?, ?it/s] File "/home/polyana/github/diffusers/examples/dreambooth/train_dreambooth.py", line 662, in <module> main(args) File "/home/polyana/github/diffusers/examples/dreambooth/train_dreambooth.py", line 646, in main pipeline = StableDiffusionPipeline.from_pretrained( File "/home/polyana/anaconda3/envs/diffusers/lib/python3.9/site-packages/diffusers/pipeline_utils.py", line 577, in from_pretrained raise ValueError( ValueError: Pipeline <class 'diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline'> expected {'unet', 'safety_checker', 'text_encoder', 'vae', 'feature_extractor', 'tokenizer', 'scheduler'}, but only {'unet', 'text_encoder', 'vae', 'tokenizer', 'scheduler'} were passed. Steps: 100%|████████████████████████████████████████████████████████████████████| 10/10 [00:20<00:00, 2.04s/it, loss=0.101, lr=5e-6] ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 29405) of binary: /home/polyana/anaconda3/envs/diffusers/bin/pythonI Use this sh file with WSL2 ubuntu 22.04 and DeepSpeed
export LD_LIBRARY_PATH=/usr/lib/wsl/lib:$LD_LIBRARY_PATH export MODEL_NAME="****/****" export INSTANCE_DIR="training" export CLASS_DIR="classes" export OUTPUT_DIR="model" accelerate launch train_dreambooth.py \ --pretrained_model_name_or_path=$MODEL_NAME \ --instance_data_dir=$INSTANCE_DIR \ --class_data_dir=$CLASS_DIR \ --output_dir=$OUTPUT_DIR \ --with_prior_preservation --prior_loss_weight=1.0 \ --instance_prompt="photo of a" \ --class_prompt="photo of a \ --resolution=512 \ --train_batch_size=1 \ --sample_batch_size=1 \ --gradient_accumulation_steps=1 --gradient_checkpointing \ --learning_rate=5e-6 \ --lr_scheduler="constant" \ --lr_warmup_steps=0 \ --num_class_images=10 \ --max_train_steps=10 \ --mixed_precision=fp16If I don't want to see torch.distributed.elastic.multiprocessing.api:failed) error, is there any correction?
Do your problem solved?
Gently ping again @patil-suraj here
I think maybe its because the ram is not enough?
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Gently ping again @patil-suraj and also @williamberman and @pcuenca
Think this issue should have been resolved now with diffusers main. Let us know if it's standing.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.