ColossalAI
ColossalAI copied to clipboard
[BUG]: How can run examples/images/diffusion with use_ema
🐛 Describe the bug
I can successfully ran the exampls with default setting. according to https://github.com/hpcaitech/ColossalAI/tree/main/examples/images/diffusion
When I change the value of use_ema from False to True, error occurred:
what would be the reason for this problem? Thanks.
log info :
Project config
model:
base_learning_rate: 0.0001
target: ldm.models.diffusion.ddpm.LatentDiffusion
params:
linear_start: 0.00085
linear_end: 0.012
num_timesteps_cond: 1
log_every_t: 200
timesteps: 1000
first_stage_key: image
cond_stage_key: caption
image_size: 64
channels: 4
cond_stage_trainable: false
conditioning_key: crossattn
monitor: val/loss_simple_ema
scale_factor: 0.18215
use_ema: true
scheduler_config:
target: ldm.lr_scheduler.LambdaLinearScheduler
params:
warm_up_steps:
- 1
cycle_lengths:
- 10000000000000
f_start:
- 1.0e-06
f_max:
- 0.0001
f_min:
- 1.0e-10
unet_config:
target: ldm.modules.diffusionmodules.openaimodel.UNetModel
params:
image_size: 32
from_pretrained: /home//data/stable-diffusion-v1-4/unet/diffusion_pytorch_model.bin
in_channels: 4
out_channels: 4
model_channels: 320
attention_resolutions:
- 4
- 2
- 1
num_res_blocks: 2
channel_mult:
- 1
- 2
- 4
- 4
num_heads: 8
use_spatial_transformer: true
transformer_depth: 1
context_dim: 768
use_checkpoint: false
legacy: false
use_fp16: true
first_stage_config:
target: ldm.models.autoencoder.AutoencoderKL
params:
embed_dim: 4
from_pretrained: /home//data/stable-diffusion-v1-4/vae/diffusion_pytorch_model.bin
monitor: val/rec_loss
ddconfig:
double_z: true
z_channels: 4
resolution: 256
in_channels: 3
out_ch: 3
ch: 128
ch_mult:
- 1
- 2
- 4
- 4
num_res_blocks: 2
attn_resolutions: []
dropout: 0.0
lossconfig:
target: torch.nn.Identity
cond_stage_config:
target: ldm.modules.encoders.modules.FrozenCLIPEmbedder
params:
use_fp16: true
use_fp16: true
data:
target: main.DataModuleFromConfig
params:
batch_size: 16
wrap: false
train:
target: ldm.data.base.Txt2ImgIterableBaseDataset
params:
file_path: /home/notebook/data/group/huangxin/laion-400m/e-commerce/e-commerce-0.tsv
world_size: 1
rank: 0
Lightning config
trainer:
accelerator: gpu
devices: 1
log_gpu_memory: all
max_epochs: 2
precision: 16
auto_select_gpus: false
strategy:
target: pytorch_lightning.strategies.ColossalAIStrategy
params:
use_chunk: false
enable_distributed_storage: True,
placement_policy: cuda
force_outputs_fp32: false
log_every_n_steps: 2
logger: true
default_root_dir: /tmp/diff_log/
profiler: pytorch
logger_config:
wandb:
target: pytorch_lightning.loggers.WandbLogger
params:
name: nowname
save_dir: /tmp/diff_log/
offline: opt.debug
id: nowname
samples11 in dataset 2139828
samples11 in dataset 2139828
samples11 in dataset 2139828
samples11 in dataset 2139828
samples11 in dataset 2139828
samples11 in dataset 2139828
samples11 in dataset 2139828
samples11 in dataset 2139828
Epoch 0: 0%| | 0/133740 [00:00<?, ?it/s] samples11 in dataset 2139828
samples11 in dataset 2139828
/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:233: UserWarning: You called `self.log('global_step', ...)` in your `training_step` but the value needs to be floating point. Converting it to torch.float32.
warning_cache.warn(
samples11 in dataset 2139828
samples11 in dataset 2139828
[11/16/22 11:35:24] INFO colossalai - colossalai - INFO:
/opt/conda/envs/ldm/lib/python3.9/site-packages/col
ossalai/zero/zero_optimizer.py:137 step
INFO colossalai - colossalai - INFO: Found overflow.
Skip step
/opt/conda/envs/ldm/lib/python3.9/site-packages/torch/optim/lr_scheduler.py:131: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`. Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
warnings.warn("Detected call of `lr_scheduler.step()` before `optimizer.step()`. "
Epoch 0: 0%| | 1/133740 [00:49<1833:05:15, 49.34s/it, loss=0.175, v_num=0, train/loss_simple_step=0.175, train/loss_vlb_step=0.0018, train/loss_step=0.175, global_step=0.000, lr_abs=1.6e-9]Summoning checkpoint.
[11/16/22 11:35:27] INFO colossalai - ProcessGroup - INFO:
/opt/conda/envs/ldm/lib/python3.9/site-packages/col
ossalai/tensor/process_group.py:24 get
INFO colossalai - ProcessGroup - INFO: NCCL initialize
ProcessGroup on [0]
FIT Profiler Report
Profile stats for: records
------------------------- ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls
------------------------- ------------ ------------ ------------ ------------ ------------ ------------
cudaDeviceSynchronize 100.00% 71.000us 100.00% 71.000us 71.000us 1
------------------------- ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 71.000us
Traceback (most recent call last):
File "/home//code/ColossalAI/examples/images/diffusion/main.py", line 817, in <module>
trainer.fit(model, data)
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 579, in fit
call._call_and_handle_interrupt(
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 36, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 90, in launch
return function(*args, **kwargs)
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 621, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1058, in _run
results = self._run_stage()
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1137, in _run_stage
self._run_train()
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1160, in _run_train
self.fit_loop.run()
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 267, in advance
self._outputs = self.epoch_loop.run(self._data_fetcher)
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 231, in advance
self.trainer._call_lightning_module_hook("on_train_batch_end", batch_end_outputs, batch, batch_idx)
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1302, in _call_lightning_module_hook
output = fn(*args, **kwargs)
File "/home//code/ColossalAI/examples/images/diffusion/ldm/models/diffusion/ddpm.py", line 436, in on_train_batch_end
self.model_ema(self.model)
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/home//code/ColossalAI/examples/images/diffusion/ldm/modules/ema.py", line 42, in forward
shadow_params[sname].sub_(one_minus_decay * (shadow_params[sname] - m_param[key]))
AttributeError: 'NoneType' object has no attribute 'sub_'
Environment
Thanks for your issue, we will fix the bug as soon as we can
If cond_stage_trainable = True
, it will also report an error
│ /opt/conda/lib/python3.7/site-packages/colossalai/gemini/chunk/manager.py:159 in get_chunk
│
│ 156 Args:
│ 157 tensor (torch.Tensor): a torch tensor object
│ 158 """
│ ❱ 159 return self.tensor_chunk_map[tensor]
│ 160
│ 161 def get_cuda_movable_chunks(self) -> List[Chunk]:
│ 162 """
╰───────────────────────────────────────────────────────
KeyError: ColoParameter: ColoTensor:
Parameter containing:
Parameter(ColoParameter([[ 4.2009e-04, -3.7899e-03, 3.8624e-03, ..., -8.2350e-04,
1.2369e-03, 5.8413e-04],
[ 3.8624e-04, -1.3628e-03, 2.3880e-03, ..., -7.9250e-04,
2.1076e-03, 1.0943e-04],
[ 1.2493e-03, 9.7466e-04, 1.9093e-03, ..., 1.4000e-03,
1.1845e-03, -9.9087e-04],
...,
[-1.3588e-02, -1.8244e-03, 8.0872e-03, ..., 5.8174e-03,
-1.0162e-02, -3.7980e-04],
[-1.0368e-02, 6.7711e-03, 1.0557e-03, ..., 1.1563e-05,
-9.3384e-03, -1.8854e-03],
[-1.7729e-03, -1.2070e-02, -1.2665e-02, ..., 9.3079e-03,
6.6338e-03, -6.0425e-03]], device='cuda:1',
dtype=torch.float16))
DistSpec:
placement: DistPlacementPattern.REPLICATE
ProcessGroup:
Rank: 0, World size: 1, DP degree: 1, TP degree: 1
Ranks in group: [0]
None
@Fazziekey Hi, have you fixed this problem?
@Fazziekey Hi, have you fixed this problem?
Thanks for your issue, Now, we don't support con-stage training, we will support it in the future.
@Fazziekey Hi, have you fixed this problem?
Thanks for your issue, Now, we don't support con-stage training, we will support it in the future.
is it supported now?
not yet