reach
reach copied to clipboard
Are some of the sections not extracted fully?
PR https://github.com/wellcometrust/reach/pull/319 fixes an issue with duplication, but in doing this I noticed a possible problem with the logic to the line(s) of code in grab_section
in pdf_parse.py
:
result = text[text.find(start_title.text):text.find(end_title.text)]
what's happening here is the text from all the pages tagged as being part of the references section is being cut down to just the references section text, e.g. if the text from all the pages identified as containing references was:
Some text
Reference
This is the first reference
This is the second reference
This is the third reference
Appendix
This is the appendix
This is the second line of the appendix
Then if start_title.text = 'Reference'
, end_title.text = 'Appendix'
, then we'd have cut all this text down to the bit we are interested in, i.e. between the first occurences of 'Reference' and 'Appendix': result = Reference This is the first reference This is the second reference This is the third reference
.
However, if the text in the references section contained the word 'Appendix' or 'Reference' we might run into problems. e.g. if
Some text
Reference
This is the first reference - it is a citation to the Appendix to another piece of work
This is the second reference
This is the third reference
Appendix
This is the appendix
This is the second line of the appendix
then result = Reference This is the first reference - it is a citation to the
.
In practice it might be that whatever end_title.text
is is quite unique text and very unlikely to come up in the references section anyway, but maybe we should investigate.
True, that may be an issue. I implemented it with text.find(start_title.text):text.find(end_title.text)
, so text based only. Some way you caould bypass this is by also checking the font size, but that would mean taking another step in there. Also worth noting that it may only happen on texts that have another section after the references.
I'm not sure this is worth fixing knowing that this is a bit of code we want to get rid of, but I have to admit I'm curious about the likelyhood of this issue actually happening.
I wonder if there is an attribute we could add to the titles elements which is the character index where the title starts, then use this rather than text.find(start_title.text)
.
but anyway, if this is code to be got rid of then maybe a massive rehaul is needed anyway
I think overall it could probably do with a refactor down the line, the edge case is narrow enough I'm alright with leaving it for now and circling back to re-evaluate how we implement this after we get everything else stable.
Yeah, this code was meant to be a starting point, the plan was to swap it for a much more efficient machine learning model
I think as things develop with the deep reference parser it will make sense to stop extracting reference sections at all, and just pass entire documents to the model, and let it make its own mind up about what are references or not. Even if we don't get that far, it would be good to look at a machine learning approach to reference section extraction (like @SamDepardieu's experiment).
Either way, I think there's a strong argument to use Spacy doc classes (rather than custom ones) when we are passing documents from the scraper to the reference extractor. Very happy to elaborate on why - please do include me in the discussion if there is more talk about refactoring.
I think it is about time we do that and this tighs nicely with Matt's work on the splitter. How best to proceed on this? @jdu @SamDepardieu @ivyleavedtoadflax
I think before we do that I'd like to get poppler updated in the container to a more recent version as well as LXML as I slightly suspect that with a newer version of poppler we might be able to get text out of the PDFs in a more structured format. Poppler takes a very strict approach to how text is extrapolated from the PDFs in that it doesn't join congruent text objects together which logically form a line or paragraph of text based on their coordinate space, we may still need the machine learning, but I just want the baseline text we're operating on to be of a better quality than we're currently getting before we build more on top of that I think.
I don't know if that will have a direct effect on how easy the reference extraction is, but if we can at least be sure that we're operating against text with a baseline of quality it should make subsequent operations on that text easier.
At present the deep_reference_parser operates on plain text, so pulling out a highly structured document is probably not very important. That said, it is relatively untested on full documents, as we have traditionally only given it reference sections extracted by Reach.
It's not necessarily that it's structured in the sense of an XML doc or similar. It's that we get a cleaner document out of it as a baseline to operate on where lines of text aren't broken across newlines, etc...
Can we hold fire on those kind of changes until after #48 is fixed and #315 is working well? The current deep reference parser is trained on the data we get from the current reach pipeline. If we change that too much (including fixing breaks across newlines, etc.) it could negatively affect the model, and we will have little idea what caused it unless we have the end to end evaluation working well first. If we did have a problem, the only way to fix it would be re-labelling all the training data which would take weeks - I'm very keen to avoid this!
💯% agree with @ivyleavedtoadflax
Either put into Notion or label with won't fix, as it currently is not an issue but potentially could happen in the future