litgpt icon indicating copy to clipboard operation
litgpt copied to clipboard

Make `save_hyperparameters()` robust against different CLI entry points

Open awaelchli opened this issue 11 months ago • 2 comments

If you run with

litgpt finetune ...

and when getting to saving a checkpoint, we hit this line: https://github.com/Lightning-AI/litgpt/blob/f951f9334610da35c7ecaa7e26e7ba3ac2504dab/litgpt/finetune/lora.py#L193

which reruns the CLI and parses the args that were passed. But this no longer works because it's not the same parser.

Saving LoRA weights to 'out/finetune/lora-llama2-7b/step-000200/lit_model.pth.lora'
usage: litgpt [-h] [--config CONFIG] [--print_config[=flags]] [--precision PRECISION] [--quantize QUANTIZE] [--devices DEVICES] [--seed SEED] [--lora_r LORA_R]
              [--lora_alpha LORA_ALPHA] [--lora_dropout LORA_DROPOUT] [--lora_query {true,false}] [--lora_key {true,false}] [--lora_value {true,false}]
              [--lora_projection {true,false}] [--lora_mlp {true,false}] [--lora_head {true,false}] [--data.help CLASS_PATH_OR_NAME] [--data DATA]
              [--checkpoint_dir CHECKPOINT_DIR] [--out_dir OUT_DIR] [--logger_name {wandb,tensorboard,csv}] [--train CONFIG] [--train.save_interval SAVE_INTERVAL]
              [--train.log_interval LOG_INTERVAL] [--train.global_batch_size GLOBAL_BATCH_SIZE] [--train.micro_batch_size MICRO_BATCH_SIZE]
              [--train.lr_warmup_steps LR_WARMUP_STEPS] [--train.epochs EPOCHS] [--train.max_tokens MAX_TOKENS] [--train.max_steps MAX_STEPS]
              [--train.max_seq_length MAX_SEQ_LENGTH] [--train.tie_embeddings {true,false,null}] [--train.learning_rate LEARNING_RATE]
              [--train.weight_decay WEIGHT_DECAY] [--train.beta1 BETA1] [--train.beta2 BETA2] [--train.max_norm MAX_NORM] [--train.min_lr MIN_LR] [--eval CONFIG]
              [--eval.interval INTERVAL] [--eval.max_new_tokens MAX_NEW_TOKENS] [--eval.max_iters MAX_ITERS]
error: Unrecognized arguments: finetune lora

A initial hack to fix this was done in #1103. Comment by @carmocca https://github.com/Lightning-AI/litgpt/pull/1103#discussion_r1523182612

How do you think this could be done? Do we need to choose between jsonargparse.CLI or the CLI in main and then pass the correct one to capture_parser?

We could also simplify this by not having a CLI in the scripts themselves.

We need to make it more robust.

awaelchli avatar Mar 13 '24 11:03 awaelchli

It's not clear how to make this more robust. Perhaps the best way is to drop support for python litgpt/finetune/lora.py since we no longer advertise it

carmocca avatar Mar 14 '24 16:03 carmocca

This is also an issue with

litgpt pretrain -c config_hub/pretrain/tinystories.yaml

...
...
Total parameters: 15,192,288
/graft3/datasets/user/tinystories/TinyStories_all_data already exists, skipping unpacking...
Validating ...
Measured TFLOPs: 3.16
Epoch 1 | iter 80 step 20 | loss train: 10.311, val: n/a | iter time: 298.13 ms (step) remaining time: 1 day, 19:48:16
Saving checkpoint to '/graft3/checkpoints/user/ctt/out/pretrain/stories15M/step-00000020/lit_model.pth'
usage: litgpt [-h] [--config CONFIG] [--print_config[=flags]] [--model_name MODEL_NAME]
              [--model_config MODEL_CONFIG] [--out_dir OUT_DIR]
              [--initial_checkpoint_dir INITIAL_CHECKPOINT_DIR] [--resume RESUME]
              [--data.help CLASS_PATH_OR_NAME] [--data DATA] [--train CONFIG]
              [--train.n_ciphers N_CIPHERS] [--train.save_interval SAVE_INTERVAL]
              [--train.log_interval LOG_INTERVAL] [--train.global_batch_size GLOBAL_BATCH_SIZE]
              [--train.micro_batch_size MICRO_BATCH_SIZE]
              [--train.lr_warmup_steps LR_WARMUP_STEPS] [--train.epochs EPOCHS]
              [--train.max_tokens MAX_TOKENS] [--train.max_steps MAX_STEPS]
              [--train.max_seq_length MAX_SEQ_LENGTH] [--train.tie_embeddings {true,false,null}]
              [--train.learning_rate LEARNING_RATE] [--train.weight_decay WEIGHT_DECAY]
              [--train.beta1 BETA1] [--train.beta2 BETA2] [--train.max_norm MAX_NORM]
              [--train.min_lr MIN_LR] [--eval CONFIG] [--eval.interval INTERVAL]
              [--eval.max_new_tokens MAX_NEW_TOKENS] [--eval.max_iters MAX_ITERS]
              [--devices DEVICES] [--tokenizer_dir TOKENIZER_DIR]
              [--logger_name {wandb,tensorboard,csv}] [--seed SEED]
error: Unrecognized arguments: -c config_hub/pretrain/tinystories.yaml

ivnle avatar Apr 03 '24 00:04 ivnle