reach
reach copied to clipboard
Improve identification of references sections
I know of 3 types of badly identified references sections problems:
- Text in between references is included
- Text after a references sections is included
- The references section is not entirely scraped
Examples for each of these:
1. Text in between references is included - Classify reference lines as outside
I found an example of when the document header was also a reference title in our match set, and was classified as such.
This is because the references section went over multiple pages and thus was picked up in the sections text several times.
The Reach tool identified that the reference "PRIORITIES FOR TUBERCULOSIS RESEARCH" was cited 8 times in: https://apps.who.int/iris/bitstream/handle/10665/85888/9789241505970_eng.pdf?sequence=1&isAllowed=y
2. Text after a references sections is included - Improve boundaries
The Reach tool identified that the reference "Treatment of severe malaria" was cited 55 times in: https://apps.who.int/iris/bitstream/handle/10665/78945/9789241564533_eng.pdf?sequence=1&isAllowed=y because it was the row name of a table which was repeated several times after the references section text and was included in the section scrape.
3. The references section is not entirely scraped - Improve boundaries
In https://apps.who.int/iris/bitstream/handle/10665/178166/9789290232810.pdf?sequence=5&isAllowed=y we find two papers in the references section entitled "attention deficit hyperactivity disorder", however these were not picked up in the fuzzy match search, but were found in the exact text search. This implies that the references section was not scraped properly.
I assume that this will be addressed by Matt's new model. Are there any quick fixes in the meantime? The only thing I can think is a binary classifier on the line level to say whether a line is a reference which we could construct relatively quickly but at the same time, it probably still worth it to wait for Matt's new model. Thoughts?
@ivyleavedtoadflax can this be closed? if yes, you can just close