SubjQA SubjQA wrong boolean values in entries

As reported by @arnaudstiegler (https://github.com/huggingface/datasets/issues/2503), there appears to be mismatches between some of the fileds in your SubjQA dataset.

More concretely, the boolean is_ques_subjective seems that it doesn't match the corresponding question_subj_level.

As an example, file books/splits/train.csv contains the row:

0002007770,books,interesting,matter,fascinating,part,0255768496a256c5ed7caed9d4e47e4c,a907837bafe847039c8da374a144bff9,What are the parts like?,2,0.0,False,a7f1a2503eac2580a0ebbc1d24fffca1,"While I would not recommend this book to a young reader due to a couple pretty explicate scenes I would recommend it to any adult who just loves a good book.  Once I started reading it I could not put it down.  I hesitated reading it because I didn't think that the subject matter would be interesting, but I was so wrong.  This is a wonderfully written book. ANSWERNOTFOUND",This is a wonderfully written book,"(324, 358)",2,1.0,True

where:

question_subj_level = 2
is_ques_subjective = False

whereas is_ques_subjective should be True because question_subj_level is below 4.

Issue reported by @arnaudstiegler:

SubjQA seems to have a boolean that's consistently wrong.

It defines:

question_subj_level: The subjectiviy level of the question (on a 1 to 5 scale with 1 being the most subjective). is_ques_subjective: A boolean subjectivity label derived from question_subj_level (i.e., scores below 4 are considered as subjective) However, is_ques_subjective seems to have wrong values in the entire dataset.

For instance, in the example in the dataset card, we have:

"question_subj_level": 2 "is_ques_subjective": false However, according to the description, the question should be subjective since the question_subj_level is below 4

Jun 15 '21 08:06 albertvillanova

Thank you for pointing this out!

I did look into this carefully, and ...

All numerical values (i.e., subjectivity ratings & TextBlob scores) were accurate, and there are no mistakes there.
There is a documentation error regarding the "is_ques_subjective" and "is_ans_subjective" columns. These columns (which represent a boolean version of subjectivity) were not derived from subjectivity label ratings as reported by annotators. Instead, there were derived based on the TextBlob subjectivity scores (any score above 0.5 is considered as subjective).

I'll perform one final check in the next couple of days, and update the Readme accordingly to fix the issue and avoid further confusion.

Aug 25 '21 03:08 behzadg

Thank you for pointing this out!

I did look into this carefully, and ...

All numerical values (i.e., subjectivity ratings & TextBlob scores) were accurate, and there are no mistakes there.

There is a documentation error regarding the "is_ques_subjective" and "is_ans_subjective" columns. These columns (which represent a boolean version of subjectivity) were not derived from subjectivity label ratings as reported by annotators. Instead, there were derived based on the TextBlob subjectivity scores (any score above 0.5 is considered as subjective).

I'll perform one final check in the next couple of days, and update the Readme accordingly to fix the issue and avoid further confusion.

Aug 25 '21 03:08 behzadg

SubjQA SubjQA copied to clipboard

SubjQA wrong boolean values in entries

SubjQA
SubjQA copied to clipboard