flair
flair copied to clipboard
[Bug]: corpus.make_label_dictionary generate too many tags
Describe the bug
I have only 38 tags in my POS corpus but corpus.make_label_dictionary return a dictionary of 400 tags.
To Reproduce
columns = {0: 'text', 1: 'pos'}
corpus: Corpus = ColumnCorpus('dataset/flair/', columns,train_file='train.txt',test_file='test.txt',dev_file='dev.txt')
label_dict = corpus.make_label_dictionary(label_type='pos', add_unk=True)
print(label_dict)
Expected behavior
I expect the length of label_dict to be 38 tags.
Logs and Stack traces
2023-08-07 22:04:49,436 Dictionary created for label 'pos' with 249 values: IN (seen 2771 times), JJ (seen 1938 times), NN.C.1.M (seen 1831 times), NN.C.2 (seen 1303 times), NN.C.1.F (seen 1211 times), CC (seen 1152 times), PT (seen 1140 times), RB (seen 914 times), NN.P (seen 846 times), DT (seen 666 times), VB.DX (seen 515 times), VB.PC (seen 432 times), NB (seen 365 times), VB.P (seen 342 times), VB.D (seen 332 times), PU (seen 314 times), PR.C (seen 248 times), VB.H (seen 227 times), VB.DC (seen 153 times), NG (seen 147 times)
Dictionary with 249 tags: <unk>, IN, JJ, NN.C.1.M, NN.C.2, NN.C.1.F, CC, PT, RB, NN.P, DT, VB.DX, VB.PC, NB, VB.P, VB.D, PU, PR.C, VB.H, VB.DC, NG, VB.G, VB.INF, BA, PR.P.iii, RP, VB.PX, PR.P.i, FX, VB.IMP, PR.P.ii, PR.P$, PR.W, VB.N, PR.DIS, FW, امله, ویروس, خان, شمېر, چارو, کبله, جام, مخې, څه, اباد, ورځ, ملتونو, ډګر, عربستان
Screenshots
No response
Additional Context
No response
Environment
flair = 0.12.2 torch = 2.0.1 Python = transformers = 4.31.0
Python = 3.9.17
Hi @ijazul-haq please notice that since you are using a custom private dataset, we cannot judge what is not working. You can debug this issue by:
- Identifying which tags you were not expecting (I suppose the arabic words?)
- Filter out a sentence that has such a tag
- Find the respective lines in the dataset
- Verify that the format is right and if not, correct it.
- If it is right, construct a similar example that recreates the error and share it here