flair icon indicating copy to clipboard operation
flair copied to clipboard

[Bug]: corpus.make_label_dictionary generate too many tags

Open ijazul-haq opened this issue 1 year ago • 2 comments

Describe the bug

I have only 38 tags in my POS corpus but corpus.make_label_dictionary return a dictionary of 400 tags.

To Reproduce

columns = {0: 'text', 1: 'pos'}
corpus: Corpus = ColumnCorpus('dataset/flair/', columns,train_file='train.txt',test_file='test.txt',dev_file='dev.txt')

label_dict = corpus.make_label_dictionary(label_type='pos', add_unk=True)
print(label_dict)

Expected behavior

I expect the length of label_dict to be 38 tags.

Logs and Stack traces

2023-08-07 22:04:49,436 Dictionary created for label 'pos' with 249 values: IN (seen 2771 times), JJ (seen 1938 times), NN.C.1.M (seen 1831 times), NN.C.2 (seen 1303 times), NN.C.1.F (seen 1211 times), CC (seen 1152 times), PT (seen 1140 times), RB (seen 914 times), NN.P (seen 846 times), DT (seen 666 times), VB.DX (seen 515 times), VB.PC (seen 432 times), NB (seen 365 times), VB.P (seen 342 times), VB.D (seen 332 times), PU (seen 314 times), PR.C (seen 248 times), VB.H (seen 227 times), VB.DC (seen 153 times), NG (seen 147 times)
Dictionary with 249 tags: <unk>, IN, JJ, NN.C.1.M, NN.C.2, NN.C.1.F, CC, PT, RB, NN.P, DT, VB.DX, VB.PC, NB, VB.P, VB.D, PU, PR.C, VB.H, VB.DC, NG, VB.G, VB.INF, BA, PR.P.iii, RP, VB.PX, PR.P.i, FX, VB.IMP, PR.P.ii, PR.P$, PR.W, VB.N, PR.DIS, FW, امله, ویروس, خان, شمېر, چارو, کبله, جام, مخې, څه, اباد, ورځ, ملتونو, ډګر, عربستان

Screenshots

No response

Additional Context

No response

Environment

flair = 0.12.2 torch = 2.0.1 Python = transformers = 4.31.0

ijazul-haq avatar Aug 07 '23 14:08 ijazul-haq

Python = 3.9.17

ijazul-haq avatar Aug 07 '23 14:08 ijazul-haq

Hi @ijazul-haq please notice that since you are using a custom private dataset, we cannot judge what is not working. You can debug this issue by:

  • Identifying which tags you were not expecting (I suppose the arabic words?)
  • Filter out a sentence that has such a tag
  • Find the respective lines in the dataset
  • Verify that the format is right and if not, correct it.
  • If it is right, construct a similar example that recreates the error and share it here

helpmefindaname avatar Aug 14 '23 11:08 helpmefindaname