multifit icon indicating copy to clipboard operation
multifit copied to clipboard

Specifying a validation set

Open FOX111 opened this issue 4 years ago • 1 comments

I'm training a language model similar to what has been shown here https://github.com/n-waves/multifit/blob/master/notebooks/CLS-JA.ipynb

While running cls_dataset.load_clas_databunch(bs=exp.finetune_lm.bs).show_batch() I'm getting this output

Running tokenization: 'lm-notst' ... Validation set not found using 10% of trn Data lm-notst, trn: 26925, val: 2991 Size of vocabulary: 15000 First 20 words in vocab: ['xxunk', 'xxpad', 'xxbos', 'xxfld', 'xxmaj', 'xxup', 'xxrep', 'xxwrep', '', '▁', '▁,', '▁.', '▁в', 'а', 'и', 'е', '▁и', 'й', '▁на', 'х'] Running tokenization: 'cls' ... Data cls, trn: 26925, val: 2991 Running tokenization: 'tst' ... /home/explorer/miniconda3/envs/fast/lib/python3.6/site-packages/fastai/data_block.py:537: UserWarning: You are labelling your items with CategoryList. Your valid set contained the following unknown labels, the corresponding items have been discarded. 201, 119, 192, 162, 168... if getattr(ds, 'warn', False): warn(ds.warn) Data tst, trn: 2991, val: 7448

I assume this to be a problem with misrepresentation of labels in a validation set that was inferred automatically. Is there a way to explicitly pass a validation set?

FOX111 avatar Apr 16 '20 11:04 FOX111

name your files: train.csv, dev.csv, test.csv and unsup.csv or read the from_df options

Qe42 avatar Jun 22 '20 15:06 Qe42