qlora
qlora copied to clipboard
[Bug] Test set is taken from training set
Hey, first of all, thanks for the great repo. This is great work.
I think I have found a bug regarding how the data is split in qlora.py
When you load a dataset from a local file, the dataset is automatically split into train:test splits at a 90:10 ratio. This results in a datasets dict with the keys "train"
and "test"
.
However, the code later only checks whether a dataset has an "eval"
key rather than a "test"
key in the dict.
If the code doesn't have an "eval"
key, then a test set is generated from the "train"
set, which is then the full "train"
set for training the model. This would mean that we are evaluating using training data - not a good way to test whether the model has overfit or not.
A small change would account for this, by adding a similar check for a "test"
key after the check for an "eval"
key.
Thanks!
PS, being able to change the ratio (or number of examples) of the test set when it is automatically generated (i.e., making it so that instead of a setting test_size as a constant with test_size=0.1
, it would be an input variable test_size=args.local_test_set_size
) would be a nice touch. Thanks again!