openai-python
openai-python copied to clipboard
[FEATURE REQUEST] Add stratification on train/validation split with fine_tune.prepare_data
Hello, first of all many thanks for this great library ! 🙏
When preparing data for multiclass classification for fine-tuning and accepting the split into train and validation data, I end up with a different number of classes in both datasets with respect to those I specified. Error message:
[2022-03-25 10:12:57] Fine-tune failed. Errors:
The number of classes in file-LSGG6mb4lhNMqyAxN6dA63sc does not match the number of classes specified in the hyperparameters.
The number of classes in file-tRE2P9nw9pq2NtM4qpKgceI2 does not match the number of classes specified in the hyperparameters.
It seems to me a problem related to stratification while splitting. Do you think it'd be possible to include this option in the future ? I know it's not an easy task and when you have not many examples you have to manually play with test_size
until you get the same number of classes in the splits but it could be automated by progressively increase the test_size
until train_dataset.nunique() == test_dataset.nunique()
Thanks for writing in @fratambot! I'll pass this along to the fine tuning team
This does not look like a bug in the SDK, so I'm going to go ahead and close this issue. If it's still relevant, I encourage you to repost at community.openai.com.