torchtune icon indicating copy to clipboard operation
torchtune copied to clipboard

Gemma not saving checkpoints

Open wiiiktor opened this issue 7 months ago • 7 comments

I am using Gemma-2B and it is not saving checkpoints at all. It hangs (no error, just waiting forever). I use 4 gpus, but even if memory usage is very low (5 GBs out of 24 GBs available, per each gpu), the checkpoints are not saved. I had a similar problem with not saving the llama 3 8b checkpoint, but this simply required lowering the batch size. With Gemma, batch size is low and I have plenty of memory, and the checkpoint is not saved, anyway. max_seq_len: 32, so very small. I use the latest torchtune version, but I had the same issue with the previous one.

I use the CLI command: tune run --nnodes 1 --nproc_per_node 4 lora_finetune_distributed --config ./custom_gemma_2B_lora.yaml

My config file:

tokenizer:
  _component_: torchtune.models.gemma.gemma_tokenizer
  path: /teamspace/studios/this_studio/Gemma-2B/tokenizer.model

dataset:
  _component_: torchtune.datasets.instruct_dataset
  source: yahma/alpaca-cleaned
  template: torchtune.data.AlpacaInstructTemplate
  train_on_input: True
  max_seq_len: 32
  split: train
seed: null
shuffle: False

model:
  _component_: torchtune.models.gemma.lora_gemma_2b
  lora_attn_modules: ['q_proj', 'k_proj', 'v_proj']
  apply_lora_to_mlp: True
  lora_rank: 64
  lora_alpha: 16

checkpointer:
  _component_: torchtune.utils.FullModelHFCheckpointer
  checkpoint_dir: /teamspace/studios/this_studio/Gemma-2B/
  checkpoint_files: [
    model-00001-of-00002.safetensors,
    model-00002-of-00002.safetensors,
  ]
  recipe_checkpoint: null
  output_dir: /teamspace/studios/this_studio/Gemma-2B/output/
  model_type: GEMMA
resume_from_checkpoint: False

loss:
  _component_: torch.nn.CrossEntropyLoss

optimizer:
  _component_: torch.optim.AdamW
  weight_decay: 0.01
  lr: 2e-5

lr_scheduler:
  _component_: torchtune.modules.get_cosine_schedule_with_warmup
  num_warmup_steps: 1

batch_size: 2
epochs: 3
max_steps_per_epoch: 4
gradient_accumulation_steps: 1

device: cuda
enable_activation_checkpointing: True
dtype: bf16

metric_logger:
  _component_: torchtune.utils.metric_logging.DiskLogger
  log_dir: ${output_dir}
output_dir: /teamspace/studios/this_studio/Gemma-2B/alpaca-gemma-lora/
log_every_n_steps: 1
log_peak_memory_stats: True

wiiiktor avatar Jul 28 '24 17:07 wiiiktor