qlora icon indicating copy to clipboard operation
qlora copied to clipboard

[Bug] Test set is taken from training set

Open Peter-Devine opened this issue 1 year ago • 1 comments

Hey, first of all, thanks for the great repo. This is great work.

I think I have found a bug regarding how the data is split in qlora.py

When you load a dataset from a local file, the dataset is automatically split into train:test splits at a 90:10 ratio. This results in a datasets dict with the keys "train" and "test".

However, the code later only checks whether a dataset has an "eval" key rather than a "test" key in the dict.

If the code doesn't have an "eval" key, then a test set is generated from the "train" set, which is then the full "train" set for training the model. This would mean that we are evaluating using training data - not a good way to test whether the model has overfit or not.

A small change would account for this, by adding a similar check for a "test" key after the check for an "eval" key.

Thanks!

Peter-Devine avatar Aug 22 '23 06:08 Peter-Devine

PS, being able to change the ratio (or number of examples) of the test set when it is automatically generated (i.e., making it so that instead of a setting test_size as a constant with test_size=0.1, it would be an input variable test_size=args.local_test_set_size) would be a nice touch. Thanks again!

Peter-Devine avatar Aug 22 '23 06:08 Peter-Devine