torchtune
torchtune copied to clipboard
Gemma not saving checkpoints
I am using Gemma-2B and it is not saving checkpoints at all. It hangs (no error, just waiting forever). I use 4 gpus, but even if memory usage is very low (5 GBs out of 24 GBs available, per each gpu), the checkpoints are not saved. I had a similar problem with not saving the llama 3 8b checkpoint, but this simply required lowering the batch size. With Gemma, batch size is low and I have plenty of memory, and the checkpoint is not saved, anyway. max_seq_len: 32, so very small. I use the latest torchtune version, but I had the same issue with the previous one.
I use the CLI command:
tune run --nnodes 1 --nproc_per_node 4 lora_finetune_distributed --config ./custom_gemma_2B_lora.yaml
My config file:
tokenizer:
_component_: torchtune.models.gemma.gemma_tokenizer
path: /teamspace/studios/this_studio/Gemma-2B/tokenizer.model
dataset:
_component_: torchtune.datasets.instruct_dataset
source: yahma/alpaca-cleaned
template: torchtune.data.AlpacaInstructTemplate
train_on_input: True
max_seq_len: 32
split: train
seed: null
shuffle: False
model:
_component_: torchtune.models.gemma.lora_gemma_2b
lora_attn_modules: ['q_proj', 'k_proj', 'v_proj']
apply_lora_to_mlp: True
lora_rank: 64
lora_alpha: 16
checkpointer:
_component_: torchtune.utils.FullModelHFCheckpointer
checkpoint_dir: /teamspace/studios/this_studio/Gemma-2B/
checkpoint_files: [
model-00001-of-00002.safetensors,
model-00002-of-00002.safetensors,
]
recipe_checkpoint: null
output_dir: /teamspace/studios/this_studio/Gemma-2B/output/
model_type: GEMMA
resume_from_checkpoint: False
loss:
_component_: torch.nn.CrossEntropyLoss
optimizer:
_component_: torch.optim.AdamW
weight_decay: 0.01
lr: 2e-5
lr_scheduler:
_component_: torchtune.modules.get_cosine_schedule_with_warmup
num_warmup_steps: 1
batch_size: 2
epochs: 3
max_steps_per_epoch: 4
gradient_accumulation_steps: 1
device: cuda
enable_activation_checkpointing: True
dtype: bf16
metric_logger:
_component_: torchtune.utils.metric_logging.DiskLogger
log_dir: ${output_dir}
output_dir: /teamspace/studios/this_studio/Gemma-2B/alpaca-gemma-lora/
log_every_n_steps: 1
log_peak_memory_stats: True