spacypdfreader
spacypdfreader copied to clipboard
Loss of token/document tensor at least with PDFMiner
Hello,
Thank you for this useful library !
The issue
I had the following issue, with the following code :
import spacy
from spacypdfreader import pdf_reader
nlp = spacy.load("fr_core_news_sm")
doc = pdf_reader('9.PADD_SCOT RM.pdf', nlp)
doc.tensor
I get an empty tensor.
Wheras :
import spacy
from pdfminer import high_level
nlp = spacy.load("fr_dep_news_trf")
doc = nlp(high_level.extract_text(path))
doc.tensor
Returns the right tensor.
Reason
The issue seems to comes from the fact that pdf_reader processess each page as a document and uses Doc.from_docs. It turns out that Doc.from_docs does not preserve Doc.tensor (but it is not found).
Hi omarbenhamid - thank you for creating this issue and looking the problems. I have never encountered this use case, but your explanation makes sense.
The reason each page is processed as a document is so that spacypdfreader can create the page attributes:
token._.page_numberdoc._.page_rangedoc._.first_pagedoc._.last_pagedoc._.pdf_file_namedoc._.page(int)
In your use case - do you still require the page number attributes? I think there are a few options:
- Update spacypdfreader so that it re-runs at least some of the NLP pipeline after using
Doc.from_docsso that the doc object has a tensor, but without overwriting the page number attribute (I am not sure yet how to actually do this, but I imagine it can be done) - Add a parameter to
spacypdfreader.pdf_readerthat will allow not add the page number attributes and instead run the NLP on the entire text at once. This would be a similar result to your example above.
Please let me know if you have any other ideas or suggestions?
Hello SamEdwardes I opened a discussion with guys at Explosion about behaviour of Doc.from_docs , they are thinking about whether they will fix it in spaCy directly.
Discussion is here : https://github.com/explosion/spaCy/discussions/10597
Let's wait and see if they come with a solution.
I worked around the issue from my side by using PDFMiner directly, but I lose the page information in fact ...