SubjQA
SubjQA copied to clipboard
SubjQA wrong boolean values in entries
As reported by @arnaudstiegler (https://github.com/huggingface/datasets/issues/2503), there appears to be mismatches between some of the fileds in your SubjQA dataset.
More concretely, the boolean is_ques_subjective
seems that it doesn't match the corresponding question_subj_level
.
As an example, file books/splits/train.csv
contains the row:
0002007770,books,interesting,matter,fascinating,part,0255768496a256c5ed7caed9d4e47e4c,a907837bafe847039c8da374a144bff9,What are the parts like?,2,0.0,False,a7f1a2503eac2580a0ebbc1d24fffca1,"While I would not recommend this book to a young reader due to a couple pretty explicate scenes I would recommend it to any adult who just loves a good book. Once I started reading it I could not put it down. I hesitated reading it because I didn't think that the subject matter would be interesting, but I was so wrong. This is a wonderfully written book. ANSWERNOTFOUND",This is a wonderfully written book,"(324, 358)",2,1.0,True
where:
-
question_subj_level = 2
-
is_ques_subjective = False
whereas is_ques_subjective
should be True because question_subj_level
is below 4.
Issue reported by @arnaudstiegler:
SubjQA seems to have a boolean that's consistently wrong.
It defines:
question_subj_level: The subjectiviy level of the question (on a 1 to 5 scale with 1 being the most subjective). is_ques_subjective: A boolean subjectivity label derived from question_subj_level (i.e., scores below 4 are considered as subjective) However, is_ques_subjective seems to have wrong values in the entire dataset.
For instance, in the example in the dataset card, we have:
"question_subj_level": 2 "is_ques_subjective": false However, according to the description, the question should be subjective since the question_subj_level is below 4
Thank you for pointing this out!
I did look into this carefully, and ...
- All numerical values (i.e., subjectivity ratings & TextBlob scores) were accurate, and there are no mistakes there.
- There is a documentation error regarding the "is_ques_subjective" and "is_ans_subjective" columns. These columns (which represent a boolean version of subjectivity) were not derived from subjectivity label ratings as reported by annotators. Instead, there were derived based on the TextBlob subjectivity scores (any score above 0.5 is considered as subjective).
I'll perform one final check in the next couple of days, and update the Readme accordingly to fix the issue and avoid further confusion.
Thank you for pointing this out!
I did look into this carefully, and ...
- All numerical values (i.e., subjectivity ratings & TextBlob scores) were accurate, and there are no mistakes there.
- There is a documentation error regarding the "is_ques_subjective" and "is_ans_subjective" columns. These columns (which represent a boolean version of subjectivity) were not derived from subjectivity label ratings as reported by annotators. Instead, there were derived based on the TextBlob subjectivity scores (any score above 0.5 is considered as subjective).
I'll perform one final check in the next couple of days, and update the Readme accordingly to fix the issue and avoid further confusion.