Determine the default precision and quantization in chat and generate
If you finetune a model with a certain quantization and precision setting, you still need to specify that in the chat and generate commands today:
litgpt chat \
--checkpoint_dir out/qlora-codellama-13b/final \
--precision bf16-true \
--quantize bnb.nf4-dq
Otherwise you may get an OOM or different results that you were getting during training. Since we store the hyperparameters in a yaml file, we could select the two settings automatically if they are not specified:
# uses precision=bf16-true and quantize=bnb.nf4-dq from checkpoint folder
litgpt chat --checkpoint_dir out/qlora-codellama-13b/final
We already do this in other parts of LitGPT, so we could just reuse the utility function to read the two settings from the checkpoint.
I don't see how we can tie this decision. The training and inference dtypes can be entirely different.
If it trains on 16-mixed, what would you say that it needs to use during inference? And if it trains on 16-true, inference already picks this by default.
For quantization it makes sense. Although I'm not sure that we should enable it silently