At first I try to train coarse VAE using the given command
python train.py ./configs/shapenet/chair/train_vae_16x16x16_dense.yaml --wname 16x16x16-kld-0.03_dim-16 --max_epochs 100 --cut_ratio 16 --gpus 1 --batch_size 16
Due to the GPU different (My gpu is one A800 but 8 * V100 said in paper), I change the bs to 16 and set gradient_accumulation to 2.
After successfully coarse VAE training, I try to train coarse diffusion using the given command (still only bs and gradient_accumulation be changed)
python train.py ./configs/shapenet/chair/train_diffusion_16x16x16_dense.yaml --wname 16x16x16_kld-0.03 --eval_interval 5 --gpus 1 --batch_size 8 --accumulate_grad_batches 32
But error occur!!!
2024-07-19 15:47:45.053 | INFO | main::171 - This is train_auto.py! Please note that you should use 300 instead of 300.0 for resuming.
git root error: Cmd('git') failed due to: exit code(128)
cmdline: git rev-parse --show-toplevel
stderr: 'fatal: detected dubious ownership in repository at '/mnt/pfs/users/dengken/code/XCube'
To add an exception for this directory, call:
git config --global --add safe.directory /mnt/pfs/users/dengken/code/XCube'
git root error: Cmd('git') failed due to: exit code(128)
cmdline: git rev-parse --show-toplevel
stderr: 'fatal: detected dubious ownership in repository at '/mnt/pfs/users/dengken/code/XCube'
To add an exception for this directory, call:
git config --global --add safe.directory /mnt/pfs/users/dengken/code/XCube'
wandb: Currently logged in as: 13532152291 (13532152291-sun-yat-sen-university). Use wandb login --relogin to force relogin
wandb: Tracking run with wandb version 0.17.3
wandb: Run data is saved locally in ../wandb/wandb/run-20240719_154747-rk4p0a77
wandb: Run wandb offline to turn off syncing.
wandb: Syncing run chair_diffusion_dense/16x16x16_kld-0.03
wandb: ⭐️ View project at https://wandb.ai/13532152291-sun-yat-sen-university/xcube-shapenet
wandb: 🚀 View run at https://wandb.ai/13532152291-sun-yat-sen-university/xcube-shapenet/runs/rk4p0a77
[rank: 0] Global seed set to 0
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/utilities/distributed.py:258: LightningDeprecationWarning: pytorch_lightning.utilities.distributed.rank_zero_only has been deprecated in v1.8.1 and will be removed in v2.0.0. You can import it from pytorch_lightning.utilities instead.
rank_zero_deprecation(
2024-07-19 15:48:01.165 | INFO | xcube.modules.autoencoding.sunet:init:240 - latent dim: 16
Traceback (most recent call last):
File "/mnt/pfs/users/dengken/code/XCube/train.py", line 380, in
net_model = net_module(model_args)
File "/mnt/pfs/users/dengken/code/XCube/xcube/models/diffusion.py", line 84, in init
self.vae = self.load_first_stage_from_pretrained().eval()
File "/root/miniconda3/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/mnt/pfs/users/dengken/code/XCube/xcube/models/diffusion.py", line 264, in load_first_stage_from_pretrained
return net_module.load_from_checkpoint(args_ckpt, hparams=model_args)
File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/core/saving.py", line 139, in load_from_checkpoint
return _load_from_checkpoint(
File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/core/saving.py", line 188, in _load_from_checkpoint
return _load_state(cls, checkpoint, strict=strict, **kwargs)
File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/core/saving.py", line 247, in _load_state
keys = obj.load_state_dict(checkpoint["state_dict"], strict=strict)
File "/root/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 2153, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for Model:
size mismatch for unet.pre_kl_bottleneck.pre_kl_bottleneck_1.SingleConv1.Conv.weight: copying a param with shape torch.Size([64, 512, 3, 3, 3]) from checkpoint, the shape in current model is torch.Size([32, 512, 3, 3, 3]).
size mismatch for unet.pre_kl_bottleneck.pre_kl_bottleneck_1.SingleConv2.GroupNorm.weight: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]).
size mismatch for unet.pre_kl_bottleneck.pre_kl_bottleneck_1.SingleConv2.GroupNorm.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]).
size mismatch for unet.pre_kl_bottleneck.pre_kl_bottleneck_1.SingleConv2.Conv.weight: copying a param with shape torch.Size([64, 64, 3, 3, 3]) from checkpoint, the shape in current model is torch.Size([32, 32, 3, 3, 3]).
size mismatch for unet.post_kl_bottleneck.post_kl_bottleneck_0.SingleConv1.GroupNorm.weight: copying a param with shape torch.Size([32]) from checkpoint, the shape in current model is torch.Size([16]).
size mismatch for unet.post_kl_bottleneck.post_kl_bottleneck_0.SingleConv1.GroupNorm.bias: copying a param with shape torch.Size([32]) from checkpoint, the shape in current model is torch.Size([16]).
size mismatch for unet.post_kl_bottleneck.post_kl_bottleneck_0.SingleConv1.Conv.weight: copying a param with shape torch.Size([512, 32, 3, 3, 3]) from checkpoint, the shape in current model is torch.Size([512, 16, 3, 3, 3]).
Traceback (most recent call last):
File "/mnt/pfs/users/dengken/code/XCube/train.py", line 380, in
net_model = net_module(model_args)
File "/mnt/pfs/users/dengken/code/XCube/xcube/models/diffusion.py", line 84, in init
self.vae = self.load_first_stage_from_pretrained().eval()
File "/root/miniconda3/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/mnt/pfs/users/dengken/code/XCube/xcube/models/diffusion.py", line 264, in load_first_stage_from_pretrained
return net_module.load_from_checkpoint(args_ckpt, hparams=model_args)
File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/core/saving.py", line 139, in load_from_checkpoint
return _load_from_checkpoint(
File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/core/saving.py", line 188, in _load_from_checkpoint
return _load_state(cls, checkpoint, strict=strict, **kwargs)
File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/core/saving.py", line 247, in _load_state
keys = obj.load_state_dict(checkpoint["state_dict"], strict=strict)
File "/root/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 2153, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for Model:
size mismatch for unet.pre_kl_bottleneck.pre_kl_bottleneck_1.SingleConv1.Conv.weight: copying a param with shape torch.Size([64, 512, 3, 3, 3]) from checkpoint, the shape in current model is torch.Size([32, 512, 3, 3, 3]).
size mismatch for unet.pre_kl_bottleneck.pre_kl_bottleneck_1.SingleConv2.GroupNorm.weight: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]).
size mismatch for unet.pre_kl_bottleneck.pre_kl_bottleneck_1.SingleConv2.GroupNorm.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]).
size mismatch for unet.pre_kl_bottleneck.pre_kl_bottleneck_1.SingleConv2.Conv.weight: copying a param with shape torch.Size([64, 64, 3, 3, 3]) from checkpoint, the shape in current model is torch.Size([32, 32, 3, 3, 3]).
size mismatch for unet.post_kl_bottleneck.post_kl_bottleneck_0.SingleConv1.GroupNorm.weight: copying a param with shape torch.Size([32]) from checkpoint, the shape in current model is torch.Size([16]).
size mismatch for unet.post_kl_bottleneck.post_kl_bottleneck_0.SingleConv1.GroupNorm.bias: copying a param with shape torch.Size([32]) from checkpoint, the shape in current model is torch.Size([16]).
size mismatch for unet.post_kl_bottleneck.post_kl_bottleneck_0.SingleConv1.Conv.weight: copying a param with shape torch.Size([512, 32, 3, 3, 3]) from checkpoint, the shape in current model is torch.Size([512, 16, 3, 3, 3]).
wandb: 🚀 View run chair_diffusion_dense/16x16x16_kld-0.03 at: https://wandb.ai/13532152291-sun-yat-sen-university/xcube-shapenet/runs/rk4p0a77
wandb: ⭐️ View project at: https://wandb.ai/13532152291-sun-yat-sen-university/xcube-shapenet
wandb: Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Find logs at: ../wandb/wandb/run-20240719_154747-rk4p0a77/logs
wandb: WARNING The new W&B backend becomes opt-out in version 0.18.0; try it out with wandb.require("core")! See https://wandb.me/wandb-core for more information.
And their is no error using the ckpt download from your VAE
Could you please help me?Thanks