BBQ
BBQ copied to clipboard
Reproducibility Issues (Bias Score and Accuracy)
I have been working on reproducing the results using the code from your repository and implementing it in Python. I successfully converted the provided R code into Python, and the outputs from both versions match. However, the results from your repository code do not align with the figures presented in your research paper.
Code used: [BBQ_calculate_bias_score.R] (https://github.com/nyu-mll/BBQ/blob/main/analysis_scripts/BBQ_calculate_bias_score.R) Research paper link: [QA Bias Benchmark] (https://github.com/nyu-mll/BBQ/blob/main/QA_bias_benchmark.pdf)
Here is the output I obtained using the R code from your repository:
Comparing Dberta V3 Base (For Disambiguous): Comparison With Paper : - Age : Match - Disability : Match - Gender Identity : 13.9 (R-Code) instead of 15 (paper) - Gender Identity Names : 12.7 instead of 14 - Nationality : Match - Physical Appearance : 41.8 instead of 41 - Race and Ethnicity : 4.7 instead of 4.6 - Race and Ethnicity Names : Match - Religion : Match - Sexual Orientation : Match - SES : Match
Same patter can be seen across other models also.
Any help/clarifications would be appreciated here. Thanks