diffusers icon indicating copy to clipboard operation
diffusers copied to clipboard

ERROR:torch.distributed.elastic.multiprocessing.api:failed?

Open A-Polyana opened this issue 3 years ago • 12 comments

Hello I've found some problems it`s Before Make Classes, After Finish train

If make Classes Image

The following values were not passed to `accelerate launch` and had defaults used instead:
        `--num_cpu_threads_per_process` was set to `4` to improve out-of-box performance
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.
Moving 0 files to the new cache system
0it [00:00, ?it/s]
[2022-10-30 21:53:50,853] [INFO] [comm.py:633:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Fetching 12 files: 100%|██████████████████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 46951.16it/s]
Traceback (most recent call last):
  File "/home/polyana/github/diffusers/examples/dreambooth/train_dreambooth.py", line 662, in <module>
    main(args)
  File "/home/polyana/github/diffusers/examples/dreambooth/train_dreambooth.py", line 356, in main
    pipeline = StableDiffusionPipeline.from_pretrained(
  File "/home/polyana/anaconda3/envs/diffusers/lib/python3.9/site-packages/diffusers/pipeline_utils.py", line 577, in from_pretrained
    raise ValueError(
ValueError: Pipeline <class 'diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline'> expected {'unet', 'text_encoder', 'safety_checker', 'scheduler', 'vae', 'tokenizer', 'feature_extractor'}, but only {'unet', 'text_encoder', 'safety_checker', 'scheduler', 'vae', 'tokenizer'} were passed.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 29305) of binary: /home/polyana/anaconda3/envs/diffusers/bin/python

After Finish Train

[2022-10-30 21:55:31,658] [INFO] [stage_1_and_2.py:1763:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4294967296, reducing to 2147483648.0
Steps:  10%|██████▉                                                              | 1/10 [00:07<01:09,  7.68s/it, loss=0.232, lr=5e-6][2022-10-30 21:55:32,800] [INFO] [stage_1_and_2.py:1763:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2147483648.0, reducing to 1073741824.0
Steps:  20%|██████████████                                                        | 2/10 [00:08<00:30,  3.83s/it, loss=0.51, lr=5e-6][2022-10-30 21:55:33,940] [INFO] [stage_1_and_2.py:1763:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1073741824.0, reducing to 536870912.0
Steps:  30%|████████████████████▋                                                | 3/10 [00:09<00:18,  2.60s/it, loss=0.437, lr=5e-6][2022-10-30 21:55:35,048] [INFO] [stage_1_and_2.py:1763:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 536870912.0, reducing to 268435456.0
Steps:  40%|███████████████████████████▌                                         | 4/10 [00:11<00:12,  2.01s/it, loss=0.312, lr=5e-6][2022-10-30 21:55:36,162] [INFO] [stage_1_and_2.py:1763:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 268435456.0, reducing to 134217728.0
Steps:  50%|██████████████████████████████████▌                                  | 5/10 [00:12<00:08,  1.69s/it, loss=0.592, lr=5e-6][2022-10-30 21:55:37,244] [INFO] [stage_1_and_2.py:1763:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 134217728.0, reducing to 67108864.0
Steps:  60%|█████████████████████████████████████████▍                           | 6/10 [00:13<00:05,  1.48s/it, loss=0.451, lr=5e-6][2022-10-30 21:55:38,335] [INFO] [stage_1_and_2.py:1763:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 67108864.0, reducing to 33554432.0
Steps:  70%|████████████████████████████████████████████████▎                    | 7/10 [00:14<00:04,  1.35s/it, loss=0.158, lr=5e-6][2022-10-30 21:55:39,482] [INFO] [stage_1_and_2.py:1763:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 33554432.0, reducing to 16777216.0
Steps:  80%|████████████████████████████████████████████████████████              | 8/10 [00:15<00:02,  1.29s/it, loss=0.28, lr=5e-6][2022-10-30 21:55:40,565] [INFO] [stage_1_and_2.py:1763:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 16777216.0, reducing to 8388608.0
Steps:  90%|██████████████████████████████████████████████████████████████       | 9/10 [00:16<00:01,  1.22s/it, loss=0.237, lr=5e-6][2022-10-30 21:55:41,609] [INFO] [stage_1_and_2.py:1763:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 8388608.0, reducing to 4194304.0
Fetching 12 files: 100%|███████████████████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 7842.26it/s]
Traceback (most recent call last):                                                                            | 0/12 [00:00<?, ?it/s]
  File "/home/polyana/github/diffusers/examples/dreambooth/train_dreambooth.py", line 662, in <module>
    main(args)
  File "/home/polyana/github/diffusers/examples/dreambooth/train_dreambooth.py", line 646, in main
    pipeline = StableDiffusionPipeline.from_pretrained(
  File "/home/polyana/anaconda3/envs/diffusers/lib/python3.9/site-packages/diffusers/pipeline_utils.py", line 577, in from_pretrained
    raise ValueError(
ValueError: Pipeline <class 'diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline'> expected {'unet', 'safety_checker', 'text_encoder', 'vae', 'feature_extractor', 'tokenizer', 'scheduler'}, but only {'unet', 'text_encoder', 'vae', 'tokenizer', 'scheduler'} were passed.
Steps: 100%|████████████████████████████████████████████████████████████████████| 10/10 [00:20<00:00,  2.04s/it, loss=0.101, lr=5e-6]
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 29405) of binary: /home/polyana/anaconda3/envs/diffusers/bin/python

I Use this sh file with WSL2 ubuntu 22.04 and DeepSpeed

export LD_LIBRARY_PATH=/usr/lib/wsl/lib:$LD_LIBRARY_PATH
export MODEL_NAME="****/****"
export INSTANCE_DIR="training"
export CLASS_DIR="classes"
export OUTPUT_DIR="model"
 
accelerate launch train_dreambooth.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --instance_data_dir=$INSTANCE_DIR \
  --class_data_dir=$CLASS_DIR \
  --output_dir=$OUTPUT_DIR \
  --with_prior_preservation --prior_loss_weight=1.0 \
  --instance_prompt="photo of a" \
  --class_prompt="photo of a \
  --resolution=512 \
  --train_batch_size=1 \
  --sample_batch_size=1 \
  --gradient_accumulation_steps=1 --gradient_checkpointing \
  --learning_rate=5e-6 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --num_class_images=10 \
  --max_train_steps=10 \
  --mixed_precision=fp16

If I don't want to see torch.distributed.elastic.multiprocessing.api:failed) error, is there any correction?

A-Polyana avatar Oct 30 '22 13:10 A-Polyana

I got the same problem when use 'accelerate config' with deepseed option.

CrazyBoyM avatar Nov 01 '22 05:11 CrazyBoyM

I got the same problem when use 'accelerate config' with deepseed option.

I think it's related to this code valueError: Pipeline <class 'diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline'> expected {'unet', 'text_encoder', 'safety_checker', 'scheduler', 'vae', 'tokenizer', 'feature_extractor'}, but only {'unet', 'text_encoder', 'safety_checker', 'scheduler', 'vae', 'tokenizer'} were passed.

I'm sad that I can't solve it cuz I'm not an expert :(

A-Polyana avatar Nov 01 '22 05:11 A-Polyana

@patil-suraj could you take a look here?

patrickvonplaten avatar Nov 02 '22 17:11 patrickvonplaten

Hi, this issue has been fixed in recent diffusers versions, which allows safety_checker=None, could you please update your diffusers version pip install -U diffusers. That should fix it.

patil-suraj avatar Nov 09 '22 14:11 patil-suraj

Hi, this issue has been fixed in recent diffusers versions, which allows safety_checker=None, could you please update your diffusers version pip install -U diffusers. That should fix it.

Still get the error after execute 'pip install -U diffusers'

diffusers version is 0.8.1

@patil-suraj can you help here?

universewill avatar Nov 25 '22 14:11 universewill

Hello I've found some problems it`s Before Make Classes, After Finish train

If make Classes Image

The following values were not passed to `accelerate launch` and had defaults used instead:
        `--num_cpu_threads_per_process` was set to `4` to improve out-of-box performance
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.
Moving 0 files to the new cache system
0it [00:00, ?it/s]
[2022-10-30 21:53:50,853] [INFO] [comm.py:633:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Fetching 12 files: 100%|██████████████████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 46951.16it/s]
Traceback (most recent call last):
  File "/home/polyana/github/diffusers/examples/dreambooth/train_dreambooth.py", line 662, in <module>
    main(args)
  File "/home/polyana/github/diffusers/examples/dreambooth/train_dreambooth.py", line 356, in main
    pipeline = StableDiffusionPipeline.from_pretrained(
  File "/home/polyana/anaconda3/envs/diffusers/lib/python3.9/site-packages/diffusers/pipeline_utils.py", line 577, in from_pretrained
    raise ValueError(
ValueError: Pipeline <class 'diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline'> expected {'unet', 'text_encoder', 'safety_checker', 'scheduler', 'vae', 'tokenizer', 'feature_extractor'}, but only {'unet', 'text_encoder', 'safety_checker', 'scheduler', 'vae', 'tokenizer'} were passed.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 29305) of binary: /home/polyana/anaconda3/envs/diffusers/bin/python

After Finish Train

[2022-10-30 21:55:31,658] [INFO] [stage_1_and_2.py:1763:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4294967296, reducing to 2147483648.0
Steps:  10%|██████▉                                                              | 1/10 [00:07<01:09,  7.68s/it, loss=0.232, lr=5e-6][2022-10-30 21:55:32,800] [INFO] [stage_1_and_2.py:1763:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2147483648.0, reducing to 1073741824.0
Steps:  20%|██████████████                                                        | 2/10 [00:08<00:30,  3.83s/it, loss=0.51, lr=5e-6][2022-10-30 21:55:33,940] [INFO] [stage_1_and_2.py:1763:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1073741824.0, reducing to 536870912.0
Steps:  30%|████████████████████▋                                                | 3/10 [00:09<00:18,  2.60s/it, loss=0.437, lr=5e-6][2022-10-30 21:55:35,048] [INFO] [stage_1_and_2.py:1763:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 536870912.0, reducing to 268435456.0
Steps:  40%|███████████████████████████▌                                         | 4/10 [00:11<00:12,  2.01s/it, loss=0.312, lr=5e-6][2022-10-30 21:55:36,162] [INFO] [stage_1_and_2.py:1763:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 268435456.0, reducing to 134217728.0
Steps:  50%|██████████████████████████████████▌                                  | 5/10 [00:12<00:08,  1.69s/it, loss=0.592, lr=5e-6][2022-10-30 21:55:37,244] [INFO] [stage_1_and_2.py:1763:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 134217728.0, reducing to 67108864.0
Steps:  60%|█████████████████████████████████████████▍                           | 6/10 [00:13<00:05,  1.48s/it, loss=0.451, lr=5e-6][2022-10-30 21:55:38,335] [INFO] [stage_1_and_2.py:1763:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 67108864.0, reducing to 33554432.0
Steps:  70%|████████████████████████████████████████████████▎                    | 7/10 [00:14<00:04,  1.35s/it, loss=0.158, lr=5e-6][2022-10-30 21:55:39,482] [INFO] [stage_1_and_2.py:1763:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 33554432.0, reducing to 16777216.0
Steps:  80%|████████████████████████████████████████████████████████              | 8/10 [00:15<00:02,  1.29s/it, loss=0.28, lr=5e-6][2022-10-30 21:55:40,565] [INFO] [stage_1_and_2.py:1763:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 16777216.0, reducing to 8388608.0
Steps:  90%|██████████████████████████████████████████████████████████████       | 9/10 [00:16<00:01,  1.22s/it, loss=0.237, lr=5e-6][2022-10-30 21:55:41,609] [INFO] [stage_1_and_2.py:1763:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 8388608.0, reducing to 4194304.0
Fetching 12 files: 100%|███████████████████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 7842.26it/s]
Traceback (most recent call last):                                                                            | 0/12 [00:00<?, ?it/s]
  File "/home/polyana/github/diffusers/examples/dreambooth/train_dreambooth.py", line 662, in <module>
    main(args)
  File "/home/polyana/github/diffusers/examples/dreambooth/train_dreambooth.py", line 646, in main
    pipeline = StableDiffusionPipeline.from_pretrained(
  File "/home/polyana/anaconda3/envs/diffusers/lib/python3.9/site-packages/diffusers/pipeline_utils.py", line 577, in from_pretrained
    raise ValueError(
ValueError: Pipeline <class 'diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline'> expected {'unet', 'safety_checker', 'text_encoder', 'vae', 'feature_extractor', 'tokenizer', 'scheduler'}, but only {'unet', 'text_encoder', 'vae', 'tokenizer', 'scheduler'} were passed.
Steps: 100%|████████████████████████████████████████████████████████████████████| 10/10 [00:20<00:00,  2.04s/it, loss=0.101, lr=5e-6]
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 29405) of binary: /home/polyana/anaconda3/envs/diffusers/bin/python

I Use this sh file with WSL2 ubuntu 22.04 and DeepSpeed

export LD_LIBRARY_PATH=/usr/lib/wsl/lib:$LD_LIBRARY_PATH
export MODEL_NAME="****/****"
export INSTANCE_DIR="training"
export CLASS_DIR="classes"
export OUTPUT_DIR="model"
 
accelerate launch train_dreambooth.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --instance_data_dir=$INSTANCE_DIR \
  --class_data_dir=$CLASS_DIR \
  --output_dir=$OUTPUT_DIR \
  --with_prior_preservation --prior_loss_weight=1.0 \
  --instance_prompt="photo of a" \
  --class_prompt="photo of a \
  --resolution=512 \
  --train_batch_size=1 \
  --sample_batch_size=1 \
  --gradient_accumulation_steps=1 --gradient_checkpointing \
  --learning_rate=5e-6 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --num_class_images=10 \
  --max_train_steps=10 \
  --mixed_precision=fp16

If I don't want to see torch.distributed.elastic.multiprocessing.api:failed) error, is there any correction?

Do your problem solved?

universewill avatar Nov 26 '22 10:11 universewill

Gently ping again @patil-suraj here

patrickvonplaten avatar Nov 30 '22 12:11 patrickvonplaten

I think maybe its because the ram is not enough?

universewill avatar Nov 30 '22 12:11 universewill

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Dec 24 '22 15:12 github-actions[bot]

Gently ping again @patil-suraj and also @williamberman and @pcuenca

patrickvonplaten avatar Jan 05 '23 21:01 patrickvonplaten

Think this issue should have been resolved now with diffusers main. Let us know if it's standing.

patil-suraj avatar Jan 25 '23 12:01 patil-suraj

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Feb 19 '23 15:02 github-actions[bot]