openai-python icon indicating copy to clipboard operation
openai-python copied to clipboard

[FEATURE REQUEST] Add stratification on train/validation split with fine_tune.prepare_data

Open fratambot opened this issue 2 years ago • 1 comments

Hello, first of all many thanks for this great library ! 🙏

When preparing data for multiclass classification for fine-tuning and accepting the split into train and validation data, I end up with a different number of classes in both datasets with respect to those I specified. Error message:

[2022-03-25 10:12:57] Fine-tune failed. Errors:
The number of classes in file-LSGG6mb4lhNMqyAxN6dA63sc does not match the number of classes specified in the hyperparameters.
The number of classes in file-tRE2P9nw9pq2NtM4qpKgceI2 does not match the number of classes specified in the hyperparameters.

It seems to me a problem related to stratification while splitting. Do you think it'd be possible to include this option in the future ? I know it's not an easy task and when you have not many examples you have to manually play with test_size until you get the same number of classes in the splits but it could be automated by progressively increase the test_size until train_dataset.nunique() == test_dataset.nunique()

fratambot avatar Mar 25 '22 09:03 fratambot