zero123
zero123 copied to clipboard
batch size issue when trianing custom dataset
I'm trying to train the model with a custom dataset on 4 a6000(49GB each) gpus but it takes 27GB each when training the model with batchsize 1 here is my config file and gpu status `model: base_learning_rate: 1.0e-04 target: ldm.models.diffusion.ddpm.LatentDiffusion params: linear_start: 0.00085 linear_end: 0.0120 num_timesteps_cond: 1 log_every_t: 200 timesteps: 1000 first_stage_key: "image_target" cond_stage_key: "image_cond" image_size: 32 channels: 4 cond_stage_trainable: false # Note: different from the one we trained before conditioning_key: hybrid monitor: val/loss_simple_ema scale_factor: 0.18215
scheduler_config: # 10000 warmup steps
target: ldm.lr_scheduler.LambdaLinearScheduler
params:
warm_up_steps: [ 100 ]
cycle_lengths: [ 10000000000000 ] # incredibly large number to prevent corner cases
f_start: [ 1.e-6 ]
f_max: [ 1. ]
f_min: [ 1. ]
unet_config:
target: ldm.modules.diffusionmodules.openaimodel.UNetModel
params:
image_size: 32 # unused
in_channels: 8
out_channels: 4
model_channels: 320
attention_resolutions: [ 4, 2, 1 ]
num_res_blocks: 2
channel_mult: [ 1, 2, 4, 4 ]
num_heads: 8
use_spatial_transformer: True
transformer_depth: 1
context_dim: 768
use_checkpoint: True
legacy: False
first_stage_config:
target: ldm.models.autoencoder.AutoencoderKL
params:
embed_dim: 4
monitor: val/rec_loss
ddconfig:
double_z: true
z_channels: 4
resolution: 256
in_channels: 3
out_ch: 3
ch: 128
ch_mult:
- 1
- 2
- 4
- 4
num_res_blocks: 2
attn_resolutions: []
dropout: 0.0
lossconfig:
target: torch.nn.Identity
cond_stage_config:
target: ldm.modules.encoders.modules.FrozenCLIPImageEmbedder
data:
target: ldm.data.simple.ObjaverseDataModuleFromConfig
params:
root_dir: my_path
batch_size: 1
num_workers: 8
total_view: 4
train:
validation: False
image_transforms:
size: 256
validation:
validation: True
image_transforms:
size: 256
lightning:
find_unused_parameters: false
metrics_over_trainsteps_checkpoint: True
modelcheckpoint:
params:
every_n_train_steps: 5000
callbacks:
image_logger:
target: main.ImageLogger
params:
batch_frequency: 500
max_images: 32
increase_log_steps: False
log_first_step: True
log_images_kwargs:
use_ema_scope: False
inpaint: False
plot_progressive_rows: False
plot_diffusion_rows: False
N: 32
unconditional_guidance_scale: 3.0
unconditional_guidance_label: [""]
trainer:
benchmark: True
val_check_interval: 5000000 # really sorry
num_sanity_val_steps: 0
accumulate_grad_batches: 5
Wed Apr 24 06:47:00 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05 Driver Version: 535.154.05 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA RTX A6000 Off | 00000000:1D:00.0 Off | Off |
| 48% 71C P2 203W / 300W | 27238MiB / 49140MiB | 92% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA RTX A6000 Off | 00000000:1E:00.0 Off | Off |
| 46% 70C P2 204W / 300W | 27242MiB / 49140MiB | 93% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA RTX A6000 Off | 00000000:1F:00.0 Off | Off |
| 49% 73C P2 202W / 300W | 27242MiB / 49140MiB | 94% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA RTX A6000 Off | 00000000:20:00.0 Off | Off |
| 47% 70C P2 194W / 300W | 27222MiB / 49140MiB | 94% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+`
Is it normal for batch size 1 to consume this much GPU?