lora-scripts icon indicating copy to clipboard operation
lora-scripts copied to clipboard

linux单卡训练正常多卡训练报错

Open magicwang1111 opened this issue 1 year ago • 3 comments

Traceback (most recent call last): File "/mnt/data/wangxi/lora-scripts/./sd-scripts/train_db.py", line 501, in train(args) File "/mnt/data/wangxi/lora-scripts/./sd-scripts/train_db.py", line 321, in train encoder_hidden_states = train_util.get_hidden_states( File "/mnt/data/wangxi/lora-scripts/sd-scripts/library/train_util.py", line 4003, in get_hidden_states encoder_hidden_states = text_encoder.text_model.final_layer_norm(encoder_hidden_states) File "/home/wangxi/miniconda3/envs/lora/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1614, in getattr raise AttributeError("'{}' object has no attribute '{}'".format( AttributeError: 'DistributedDataParallel' object has no attribute 'text_model' steps: 0%| | 0/22080 [00:01<?, ?it/s] Traceback (most recent call last): File "/mnt/data/wangxi/lora-scripts/./sd-scripts/train_db.py", line 501, in train(args) File "/mnt/data/wangxi/lora-scripts/./sd-scripts/train_db.py", line 321, in train encoder_hidden_states = train_util.get_hidden_states( File "/mnt/data/wangxi/lora-scripts/sd-scripts/library/train_util.py", line 4003, in get_hidden_states encoder_hidden_states = text_encoder.text_model.final_layer_norm(encoder_hidden_states) File "/home/wangxi/miniconda3/envs/lora/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1614, in getattr raise AttributeError("'{}' object has no attribute '{}'".format( AttributeError: 'DistributedDataParallel' object has no attribute 'text_model' ^CWARNING:torch.distributed.elastic.agent.server.api:Received 2 death signal, shutting down workers Traceback (most recent call last): File "/home/wangxi/miniconda3/envs/lora/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/wangxi/miniconda3/envs/lora/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/home/wangxi/miniconda3/envs/lora/lib/python3.10/site-packages/accelerate/commands/launch.py", line 996, in main() File "/home/wangxi/miniconda3/envs/lora/lib/python3.10/site-packages/accelerate/commands/launch.py", line 992, in main launch_command(args) File "/home/wangxi/miniconda3/envs/lora/lib/python3.10/site-packages/accelerate/commands/launch.py", line 977, in launch_command multi_gpu_launcher(args) File "/home/wangxi/miniconda3/envs/lora/lib/python3.10/site-packages/accelerate/commands/launch.py", line 646, in multi_gpu_launcher distrib_run.run(args) File "/home/wangxi/miniconda3/envs/lora/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/home/wangxi/miniconda3/envs/lora/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/wangxi/miniconda3/envs/lora/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 241, in launch_agent result = agent.run() File "/home/wangxi/miniconda3/envs/lora/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper result = f(*args, **kwargs) File "/home/wangxi/miniconda3/envs/lora/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 723, in run result = self._invoke_run(role) File "/home/wangxi/miniconda3/envs/lora/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 864, in _invoke_run time.sleep(monitor_interval) File "/home/wangxi/miniconda3/envs/lora/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 62, in _terminate_process_handler raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval) torch.distributed.elastic.multiprocessing.api.SignalException: Process 35054 got signal: 2 12:51:03-224816 ERROR Training failed / 训练失败 (lora) [wangxi@v100-4 lora-scripts]$ ^C (lora) [wangxi@v100-4 lora-scripts]$

magicwang1111 avatar Jan 05 '24 04:01 magicwang1111

请教一下,linux,ubuntu 上如何安装成功的

askyang avatar Jan 24 '24 12:01 askyang

我也遇到了这个问题

ik4sumi avatar Jan 25 '24 22:01 ik4sumi

单卡训练正常,多卡训练卡在 loading VAE from checkpoint VAE: <All keys matched successfully>不走了, 但是显卡利用了一直保持在99%不变,也不报错,也米有任何进度条,各位有碰到过这类问题吗?

image

以下是我的参数

model_train_type = "sdxl-finetune" pretrained_model_name_or_path = "/gemini/pretrain/juggernautXL_version6Rundiffusion.safetensors" v2 = false train_data_dir = "/gemini/code/data-3" resolution = "1024,1024" enable_bucket = true min_bucket_reso = 256 max_bucket_reso = 1024 bucket_reso_steps = 64 output_name = "xysSDXL_jugg_dir1375and32106_rep10_e200_bs1_test0001" output_dir = "./output" save_model_as = "safetensors" save_precision = "bf16" save_every_n_epochs = 40 max_train_epochs = 800 train_batch_size = 4 gradient_checkpointing = true learning_rate = 0.00001 learning_rate_te1 = 0.0000025 learning_rate_te2 = 0.0000025 lr_scheduler = "cosine_with_restarts" lr_warmup_steps = 0 lr_scheduler_num_cycles = 0 optimizer_type = "AdamW8bit" log_with = "tensorboard" logging_dir = "./logs" caption_extension = ".txt" shuffle_caption = true weighted_captions = false keep_tokens = 1 max_token_length = 255 seed = 1337 no_token_padding = false mixed_precision = "bf16" full_bf16 = true xformers = true lowram = false cache_latents = true cache_latents_to_disk = true persistent_data_loader_workers = true gpu_ids = [ "0", "2" ] train_text_encoder = true

@Akegarasu

shanshouchen avatar Mar 02 '24 14:03 shanshouchen