MatchZoo-py icon indicating copy to clipboard operation
MatchZoo-py copied to clipboard

Question about wiki qa dataset

Open RenShuhuai-Andy opened this issue 4 years ago • 0 comments

I make some analysis on wiki qa dataset:

  • training set: Left num: 2118; Right num: 18841;Relation num: 20360;positive example (with label 1) num: 1040(5.1%
  • dev set: Left num: 296;Right num: 2708;Relation num: 2733;positive example num: 140(5.12%
  • test set: Left num: 633;Right num: 5961;Relation num: 6165;positive example num: 293(4.75%

I wonder if this is the official way to combine question and answer, because the proportion of positive examples in three set is only 5%, which means if a model outputs 0 forever, it can achieve 95% accuracy? And the best performence of BERT on this dataset is just 95%. The proportion of positive and negative examples is too imbalance?

RenShuhuai-Andy avatar Jan 17 '20 09:01 RenShuhuai-Andy