reach Fairness review on deep reference parser algorithm

Tasks:

[x] Discuss and decide fairness criteria.
[ ] Conduct fairness review.
[ ] Report to team.

The deep reference parser needs to undergo a fairness review. Before this can happen we need to answer the following questions:

How we define fairness in this case?

Do we care about treating people the same across different groups?
What are the groups that we think are important if any?
Or do we care about treating every single the individual the same?

How does our definition translate to the testing for fairness in the algorithm?

Which metric(s) are most appropriate for our case?

Jan 20 '20 15:01 aoifespenge

I think the analysis should be the same as we have done so far, just replication using the new model. We should aim to do it end to end but due to the limited data we might find that we need to annotate more before we are able to in which case we might decide to postpone that for the future and just literally replicate on the existing data.

Jan 21 '20 17:01 nsorros

Why would we need to label more data @nsorros? Btw I said to @aoifespenge today that I envisage this being another airflow task that is completed at the end of a dag just like the end-to-end evaluation for the more usual metrics. Is that what you had in mind?

Jan 21 '20 19:01 ivyleavedtoadflax

Why would we need to label more data @nsorros? Btw I said to @aoifespenge today that I envisage this being another airflow task that is completed at the end of a dag just like the end-to-end evaluation for the more usual metrics. Is that what you had in mind?

Not a bad idea, we can definitely have it in Airflow as well. Even though the analysis would only change when a new model is deployed.

We would need to label more data because how would you quantify whether the algorithm is biased towards sociology or non english publication if none of your data contains either?

Jan 22 '20 10:01 nsorros

Sorry I guess my question was more, which data should we label more of?

I think it would make sense to run the ethical assessment after each update to reach, not just the model, because an improvement to the scraper, or adding a new provider could equally have an impact on the fairness of the whole pipeline.

Jan 22 '20 11:01 ivyleavedtoadflax

good point and why not. the data that might need more annotating is the gold data, more titles that are matched to pubmed ids and have the neccesary metadata.

Jan 22 '20 12:01 nsorros

Agree. More data is good data.

Jan 22 '20 15:01 ivyleavedtoadflax

This is currently blocked by https://github.com/wellcometrust/reach/issues/48

Feb 03 '20 15:02 ivyleavedtoadflax

reach reach copied to clipboard

Fairness review on deep reference parser algorithm

reach
reach copied to clipboard