Determine the default precision and quantization in chat and generate

Open awaelchli opened this issue 1 year ago • 1 comments

If you finetune a model with a certain quantization and precision setting, you still need to specify that in the chat and generate commands today:

litgpt chat \
    --checkpoint_dir out/qlora-codellama-13b/final \
    --precision bf16-true \
    --quantize bnb.nf4-dq

Otherwise you may get an OOM or different results that you were getting during training. Since we store the hyperparameters in a yaml file, we could select the two settings automatically if they are not specified:

# uses precision=bf16-true and quantize=bnb.nf4-dq from checkpoint folder
litgpt chat --checkpoint_dir out/qlora-codellama-13b/final

We already do this in other parts of LitGPT, so we could just reuse the utility function to read the two settings from the checkpoint.

Apr 04 '24 18:04 awaelchli

I don't see how we can tie this decision. The training and inference dtypes can be entirely different.

If it trains on 16-mixed, what would you say that it needs to use during inference? And if it trains on 16-true, inference already picks this by default.

For quantization it makes sense. Although I'm not sure that we should enable it silently

Apr 04 '24 18:04 carmocca