multifit
multifit copied to clipboard
Specifying a validation set
I'm training a language model similar to what has been shown here https://github.com/n-waves/multifit/blob/master/notebooks/CLS-JA.ipynb
While running cls_dataset.load_clas_databunch(bs=exp.finetune_lm.bs).show_batch()
I'm getting this output
Running tokenization: 'lm-notst' ... Validation set not found using 10% of trn Data lm-notst, trn: 26925, val: 2991 Size of vocabulary: 15000 First 20 words in vocab: ['xxunk', 'xxpad', 'xxbos', 'xxfld', 'xxmaj', 'xxup', 'xxrep', 'xxwrep', '
', '▁', '▁,', '▁.', '▁в', 'а', 'и', 'е', '▁и', 'й', '▁на', 'х'] Running tokenization: 'cls' ... Data cls, trn: 26925, val: 2991 Running tokenization: 'tst' ... /home/explorer/miniconda3/envs/fast/lib/python3.6/site-packages/fastai/data_block.py:537: UserWarning: You are labelling your items with CategoryList. Your valid set contained the following unknown labels, the corresponding items have been discarded. 201, 119, 192, 162, 168... if getattr(ds, 'warn', False): warn(ds.warn) Data tst, trn: 2991, val: 7448
I assume this to be a problem with misrepresentation of labels in a validation set that was inferred automatically. Is there a way to explicitly pass a validation set?
name your files: train.csv, dev.csv, test.csv and unsup.csv or read the from_df options