reach icon indicating copy to clipboard operation
reach copied to clipboard

Improve identification of references sections

Open lizgzil opened this issue 5 years ago • 2 comments

I know of 3 types of badly identified references sections problems:

  1. Text in between references is included
  2. Text after a references sections is included
  3. The references section is not entirely scraped

Examples for each of these:

1. Text in between references is included - Classify reference lines as outside

I found an example of when the document header was also a reference title in our match set, and was classified as such.

This is because the references section went over multiple pages and thus was picked up in the sections text several times.

The Reach tool identified that the reference "PRIORITIES FOR TUBERCULOSIS RESEARCH" was cited 8 times in: https://apps.who.int/iris/bitstream/handle/10665/85888/9789241505970_eng.pdf?sequence=1&isAllowed=y

2. Text after a references sections is included - Improve boundaries

The Reach tool identified that the reference "Treatment of severe malaria" was cited 55 times in: https://apps.who.int/iris/bitstream/handle/10665/78945/9789241564533_eng.pdf?sequence=1&isAllowed=y because it was the row name of a table which was repeated several times after the references section text and was included in the section scrape.

3. The references section is not entirely scraped - Improve boundaries

In https://apps.who.int/iris/bitstream/handle/10665/178166/9789290232810.pdf?sequence=5&isAllowed=y we find two papers in the references section entitled "attention deficit hyperactivity disorder", however these were not picked up in the fuzzy match search, but were found in the exact text search. This implies that the references section was not scraped properly.

lizgzil avatar Jul 18 '19 14:07 lizgzil

I assume that this will be addressed by Matt's new model. Are there any quick fixes in the meantime? The only thing I can think is a binary classifier on the line level to say whether a line is a reference which we could construct relatively quickly but at the same time, it probably still worth it to wait for Matt's new model. Thoughts?

nsorros avatar Jul 19 '19 15:07 nsorros

@ivyleavedtoadflax can this be closed? if yes, you can just close

kristinenielsen avatar Feb 26 '20 11:02 kristinenielsen