IDM-VTON Error while training

I was trying to train IDM VTON on VITON-HD dataset and ran into this huge error (followed instructions to set up ip adapter as in README.md)

➜ sh ./train_xl.sh
The following values were not passed to `accelerate launch` and had defaults used instead:
        `--num_processes` was set to a value of `1`
        `--num_machines` was set to a value of `1`
        `--mixed_precision` was set to a value of `'no'`
        `--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
/home/ubuntu/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be r
  warnings.warn(
/home/ubuntu/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be r
  warnings.warn(
The config attributes {'decay': 0.9999, 'inv_gamma': 1.0, 'min_decay': 0.0, 'optimization_step': 37000, 'power': 0.6666666666666666, 'update_after_s


Some weights of UNet2DConditionModel were not initialized from the model checkpoint at diffusers/stable-diffusion-xl-1.0-inpainting-0.1 and are newl
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Traceback (most recent call last):
  File "/home/ubuntu/GITHUB/yisol/IDM-VTON/train_xl.py", line 797, in <module>
    main()
  File "/home/ubuntu/GITHUB/yisol/IDM-VTON/train_xl.py", line 354, in main
    image_proj_model.load_state_dict(state_dict["image_proj"], strict=True)
  File "/home/ubuntu/conda/envs/idm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for Resampler:
        size mismatch for latents: copying a param with shape torch.Size([1, 16, 1280]) from checkpoint, the shape in current model is torch.Size([1
        size mismatch for proj_in.weight: copying a param with shape torch.Size([1280, 1280]) from checkpoint, the shape in current model is torch.S
        size mismatch for proj_in.bias: copying a param with shape torch.Size([1280]) from checkpoint, the shape in current model is torch.Size([166
        size mismatch for proj_out.weight: copying a param with shape torch.Size([2048, 1280]) from checkpoint, the shape in current model is torch.
        size mismatch for layers.0.0.norm1.weight: copying a param with shape torch.Size([1280]) from checkpoint, the shape in current model is torc
        size mismatch for layers.0.0.norm1.bias: copying a param with shape torch.Size([1280]) from checkpoint, the shape in current model is torch.
        size mismatch for layers.0.0.norm2.weight: copying a param with shape torch.Size([1280]) from checkpoint, the shape in current model is torc
        size mismatch for layers.0.0.norm2.bias: copying a param with shape torch.Size([1280]) from checkpoint, the shape in current model is torch.
        size mismatch for layers.0.0.to_q.weight: copying a param with shape torch.Size([1280, 1280]) from checkpoint, the shape in current model is
        size mismatch for layers.0.0.to_kv.weight: copying a param with shape torch.Size([2560, 1280]) from checkpoint, the shape in current model i
        size mismatch for layers.0.0.to_out.weight: copying a param with shape torch.Size([1280, 1280]) from checkpoint, the shape in current model
        size mismatch for layers.0.1.0.weight: copying a param with shape torch.Size([1280]) from checkpoint, the shape in current model is torch.Si
        size mismatch for layers.0.1.0.bias: copying a param with shape torch.Size([1280]) from checkpoint, the shape in current model is torch.Size
        size mismatch for layers.0.1.1.weight: copying a param with shape torch.Size([5120, 1280]) from checkpoint, the shape in current model is to
        size mismatch for layers.0.1.3.weight: copying a param with shape torch.Size([1280, 5120]) from checkpoint, the shape in current model is to
        size mismatch for layers.1.0.norm1.weight: copying a param with shape torch.Size([1280]) from checkpoint, the shape in current model is torc
        size mismatch for layers.1.0.norm1.bias: copying a param with shape torch.Size([1280]) from checkpoint, the shape in current model is torch.
        size mismatch for layers.1.0.norm2.weight: copying a param with shape torch.Size([1280]) from checkpoint, the shape in current model is torc
        size mismatch for layers.1.0.norm2.bias: copying a param with shape torch.Size([1280]) from checkpoint, the shape in current model is torch.
        size mismatch for layers.1.0.to_q.weight: copying a param with shape torch.Size([1280, 1280]) from checkpoint, the shape in current model is
        size mismatch for layers.1.0.to_kv.weight: copying a param with shape torch.Size([2560, 1280]) from checkpoint, the shape in current model i
        size mismatch for layers.1.0.to_out.weight: copying a param with shape torch.Size([1280, 1280]) from checkpoint, the shape in current model
        size mismatch for layers.1.1.0.weight: copying a param with shape torch.Size([1280]) from checkpoint, the shape in current model is torch.Si
        size mismatch for layers.1.1.0.bias: copying a param with shape torch.Size([1280]) from checkpoint, the shape in current model is torch.Size
        size mismatch for layers.1.1.1.weight: copying a param with shape torch.Size([5120, 1280]) from checkpoint, the shape in current model is to
        size mismatch for layers.1.1.3.weight: copying a param with shape torch.Size([1280, 5120]) from checkpoint, the shape in current model is to
        size mismatch for layers.2.0.norm1.weight: copying a param with shape torch.Size([1280]) from checkpoint, the shape in current model is torc
        size mismatch for layers.2.0.norm1.bias: copying a param with shape torch.Size([1280]) from checkpoint, the shape in current model is torch.
        size mismatch for layers.2.0.norm2.weight: copying a param with shape torch.Size([1280]) from checkpoint, the shape in current model is torc
        size mismatch for layers.2.0.norm2.bias: copying a param with shape torch.Size([1280]) from checkpoint, the shape in current model is torch.
        size mismatch for layers.2.0.to_q.weight: copying a param with shape torch.Size([1280, 1280]) from checkpoint, the shape in current model is
        size mismatch for layers.2.0.to_kv.weight: copying a param with shape torch.Size([2560, 1280]) from checkpoint, the shape in current model i
        size mismatch for layers.2.0.to_out.weight: copying a param with shape torch.Size([1280, 1280]) from checkpoint, the shape in current model
        size mismatch for layers.2.1.0.weight: copying a param with shape torch.Size([1280]) from checkpoint, the shape in current model is torch.Si
        size mismatch for layers.2.1.0.bias: copying a param with shape torch.Size([1280]) from checkpoint, the shape in current model is torch.Size
        size mismatch for layers.2.1.1.weight: copying a param with shape torch.Size([5120, 1280]) from checkpoint, the shape in current model is to
        size mismatch for layers.2.1.3.weight: copying a param with shape torch.Size([1280, 5120]) from checkpoint, the shape in current model is to
        size mismatch for layers.3.0.norm1.weight: copying a param with shape torch.Size([1280]) from checkpoint, the shape in current model is torc
        size mismatch for layers.3.0.norm1.bias: copying a param with shape torch.Size([1280]) from checkpoint, the shape in current model is torch.
        size mismatch for layers.3.0.norm2.weight: copying a param with shape torch.Size([1280]) from checkpoint, the shape in current model is torc
        size mismatch for layers.3.0.norm2.bias: copying a param with shape torch.Size([1280]) from checkpoint, the shape in current model is torch.Size([1664]).
        size mismatch for layers.3.0.to_q.weight: copying a param with shape torch.Size([1280, 1280]) from checkpoint, the shape in current model is torch.Size([1280, 1664]).
        size mismatch for layers.3.0.to_kv.weight: copying a param with shape torch.Size([2560, 1280]) from checkpoint, the shape in current model is torch.Size([2560, 1664]).
        size mismatch for layers.3.0.to_out.weight: copying a param with shape torch.Size([1280, 1280]) from checkpoint, the shape in current model is torch.Size([1664, 1280]).
        size mismatch for layers.3.1.0.weight: copying a param with shape torch.Size([1280]) from checkpoint, the shape in current model is torch.Size([1664]).
        size mismatch for layers.3.1.0.bias: copying a param with shape torch.Size([1280]) from checkpoint, the shape in current model is torch.Size([1664]).
        size mismatch for layers.3.1.1.weight: copying a param with shape torch.Size([5120, 1280]) from checkpoint, the shape in current model is torch.Size([6656, 1664]).
        size mismatch for layers.3.1.3.weight: copying a param with shape torch.Size([1280, 5120]) from checkpoint, the shape in current model is torch.Size([1664, 6656]).
Traceback (most recent call last):
  File "/home/ubuntu/conda/envs/idm/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/ubuntu/conda/envs/idm/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/home/ubuntu/conda/envs/idm/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1017, in launch_command
    simple_launcher(args)
  File "/home/ubuntu/conda/envs/idm/lib/python3.10/site-packages/accelerate/commands/launch.py", line 637, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/ubuntu/conda/envs/idm/bin/python', 'train_xl.py', '--gradient_checkpointing', '--use_8bit_adam', '--output_dir=result', '--train_batch_size=6', '--data_dir=/home/ubuntu/DATASETS/VITON_HD']' returned non-zero exit status 1.

Sep 12 '24 05:09 aravindhv10

the train_xl.sh file was modified to point to the correct dataset:

CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch train_xl.py --gradient_checkpointing --use_8bit_adam --output_dir=result --train_batch_size=6 --data_dir=/home/ubuntu/DATASETS/VITON_HD

Sep 12 '24 05:09 aravindhv10

I linked the ip adapter and image encoder from original HF repo into the ckpt folder:

➜ ls -lh ./ckpt/ip_adapter/ ./ckpt/image_encoder/
./ckpt/ip_adapter/:
Permissions Size User   Date Modified Name
lrwxrwxrwx    59 ubuntu 11 Sep 19:43  ip-adapter-plus_sdxl_vit-h.bin -> ../../IP-Adapter/sdxl_models/ip-adapter-plus_sdxl_vit-h.bin

./ckpt/image_encoder/:
Permissions Size User   Date Modified Name
lrwxrwxrwx    54 ubuntu 11 Sep 19:42  config.json -> ../../IP-Adapter/sdxl_models/image_encoder/config.json
lrwxrwxrwx    60 ubuntu 11 Sep 19:42  model.safetensors -> ../../IP-Adapter/sdxl_models/image_encoder/model.safetensors
lrwxrwxrwx    60 ubuntu 11 Sep 13:31  pytorch_model.bin -> ../../IP-Adapter/sdxl_models/image_encoder/pytorch_model.bin

where the original repo:

➜ ls ./IP-Adapter/sdxl_models -lh
Permissions Size User   Date Modified Name
drwxrwxr-x     - ubuntu 11 Sep 13:19  image_encoder
.rw-rw-r--  1.0G ubuntu 11 Sep 13:17  ip-adapter-plus-face_sdxl_vit-h.bin
.rw-rw-r--  848M ubuntu 11 Sep 13:17  ip-adapter-plus-face_sdxl_vit-h.safetensors
.rw-rw-r--  1.0G ubuntu 11 Sep 13:17  ip-adapter-plus_sdxl_vit-h.bin
.rw-rw-r--  848M ubuntu 11 Sep 13:17  ip-adapter-plus_sdxl_vit-h.safetensors
.rw-rw-r--  703M ubuntu 11 Sep 13:17  ip-adapter_sdxl.bin
.rw-rw-r--  703M ubuntu 11 Sep 13:17  ip-adapter_sdxl.safetensors
.rw-rw-r--  698M ubuntu 11 Sep 13:18  ip-adapter_sdxl_vit-h.bin
.rw-rw-r--  698M ubuntu 11 Sep 13:18  ip-adapter_sdxl_vit-h.safetensors

Sep 12 '24 05:09 aravindhv10

lrwxrwxrwx    60 ubuntu 11 Sep 19:42  model.safetensors -> ../../IP-Adapter/sdxl_models/image_encoder/model.safetensors
lrwxrwxrwx    60 ubuntu 11 Sep 13:31  pytorch_model.bin -> ../../IP-Adapter/sdxl_models/image_encoder/pytorch_model.bin

This is where the problem is. Instead of using the weights and config under the models folder like it's mentioned in the repo here in IP-Adapter/models/image_encoder you have used IP-Adapter/sdxl_models/image_encoder/.

Replacing that should fix your problem.

Sep 12 '24 12:09 atagulmert

Traceback (most recent call last): File "/mnt/c/Users/admin/work_cqai/IDM-VTON/train_xl.py", line 10, in <module> from transformers import CLIPImageProcessor File "/home/qammar/miniconda3/envs/idm2/lib/python3.10/site-packages/transformers/utils/import_utils.py", line 1373, in __getattr__ value = getattr(module, name) File "/home/qammar/miniconda3/envs/idm2/lib/python3.10/site-packages/transformers/utils/import_utils.py", line 1372, in __getattr__ module = self._get_module(self._class_to_module[name]) File "/home/qammar/miniconda3/envs/idm2/lib/python3.10/site-packages/transformers/utils/import_utils.py", line 1382, in _get_module return importlib.import_module("." + module_name, self.__name__) File "/home/qammar/miniconda3/envs/idm2/lib/python3.10/importlib/__init__.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "/home/qammar/miniconda3/envs/idm2/lib/python3.10/site-packages/transformers/models/clip/image_processing_clip.py", line 21, in <module> from ...image_processing_utils import BaseImageProcessor, BatchFeature, get_size_dict File "/home/qammar/miniconda3/envs/idm2/lib/python3.10/site-packages/transformers/image_processing_utils.py", line 28, in <module> from .image_transforms import center_crop, normalize, rescale File "/home/qammar/miniconda3/envs/idm2/lib/python3.10/site-packages/transformers/image_transforms.py", line 47, in <module> import tensorflow as tf File "/home/qammar/miniconda3/envs/idm2/lib/python3.10/site-packages/tensorflow/__init__.py", line 49, in <module> from tensorflow._api.v2 import __internal__ File "/home/qammar/miniconda3/envs/idm2/lib/python3.10/site-packages/tensorflow/_api/v2/__internal__/__init__.py", line 13, in <module> from tensorflow._api.v2.__internal__ import feature_column File "/home/qammar/miniconda3/envs/idm2/lib/python3.10/site-packages/tensorflow/_api/v2/__internal__/feature_column/__init__.py", line 8, in <module> from tensorflow.python.feature_column.feature_column_v2 import DenseColumn # line: 1777 File "/home/qammar/miniconda3/envs/idm2/lib/python3.10/site-packages/tensorflow/python/feature_column/feature_column_v2.py", line 38, in <module> from tensorflow.python.feature_column import feature_column as fc_old File "/home/qammar/miniconda3/envs/idm2/lib/python3.10/site-packages/tensorflow/python/feature_column/feature_column.py", line 41, in <module> from tensorflow.python.layers import base File "/home/qammar/miniconda3/envs/idm2/lib/python3.10/site-packages/tensorflow/python/layers/base.py", line 16, in <module> from tensorflow.python.keras.legacy_tf_layers import base File "/home/qammar/miniconda3/envs/idm2/lib/python3.10/site-packages/tensorflow/python/keras/__init__.py", line 25, in <module> from tensorflow.python.keras import models File "/home/qammar/miniconda3/envs/idm2/lib/python3.10/site-packages/tensorflow/python/keras/models.py", line 25, in <module> from tensorflow.python.keras.engine import training_v1 File "/home/qammar/miniconda3/envs/idm2/lib/python3.10/site-packages/tensorflow/python/keras/engine/training_v1.py", line 71, in <module> from scipy.sparse import issparse # pylint: disable=g-import-not-at-top File "/home/qammar/miniconda3/envs/idm2/lib/python3.10/site-packages/scipy/sparse/__init__.py", line 274, in <module> from ._csr import * File "/home/qammar/miniconda3/envs/idm2/lib/python3.10/site-packages/scipy/sparse/_csr.py", line 11, in <module> from ._sparsetools import (csr_tocsc, csr_tobsr, csr_count_blocks, AttributeError: _ARRAY_API not found Traceback (most recent call last): File "/mnt/c/Users/admin/work_cqai/IDM-VTON/train_xl.py", line 13, in <module> from diffusers import AutoencoderKL, DDPMScheduler File "/home/qammar/miniconda3/envs/idm2/lib/python3.10/site-packages/diffusers/__init__.py", line 5, in <module> from .utils import ( File "/home/qammar/miniconda3/envs/idm2/lib/python3.10/site-packages/diffusers/utils/__init__.py", line 37, in <module> from .dynamic_modules_utils import get_class_from_dynamic_module File "/home/qammar/miniconda3/envs/idm2/lib/python3.10/site-packages/diffusers/utils/dynamic_modules_utils.py", line 28, in <module> File "/home/qammar/miniconda3/envs/idm2/bin/accelerate", line 8, in <module> sys.exit(main()) File "/home/qammar/miniconda3/envs/idm2/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main args.func(args) File "/home/qammar/miniconda3/envs/idm2/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1017, in launch_command simple_launcher(args) File "/home/qammar/miniconda3/envs/idm2/lib/python3.10/site-packages/accelerate/commands/launch.py", line 637, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['/home/qammar/miniconda3/envs/idm2/bin/python', 'train_xl.py', '--gradient_checkpointing', '--use_8bit_adam', '--output_dir=result', '--train_batch_size=6', '--data_dir=zalando-hd-resized']' returned non-zero exit status 1.

Can anyone help me with this error

Nov 21 '24 06:11 Qammarbhat