trafilatura
trafilatura copied to clipboard
Extract content from formats other than HTML: PDF, EPUB?
So far Trafilatura focuses on HTML documents, should it be able to extract information from PDF files?
I think it's a good idea to create a function for extracting pdf content, I have several sites that return PDF's to me.
Do you have an idea how to get started? For example how to identify the column that separates the text? Or how to extract content from PDF?
pdfplumber seems interesting to me to have an idea of how to get started, because sometimes it is necessary to convert images to text as well.
There are already libraries doing the job so I would start by adding one of them, e.g. pdfplumber yes. In any case I think we would need a roadmap for the integration to go smoothly, the lines to change are scattered around the code.
Sounds good for me, it would be good to run some tests to predict what we need to have on the roadmap
OK, we'll see about that, let's wait for further feedback first.
Other ideas concerning the EPUB format this time:
- Support for many formats, including PDF & EPUB: https://github.com/pymupdf/PyMuPDF
- EPUB metadata: https://github.com/paulocheque/epub-meta
I was using some pdf extractor libraries (pdfminer
) but it could not deal with layout and hyphenation. I think allowing interaction with grobid might be a better solution for most use-cases. It turns a PDF to XMLTEI using statistical models. I'm currently using trafilatura xmltotext to convert the xmltei body to txt as well.
But either way I think handling pdfs is very important as well!
re grobid: https://grobid.readthedocs.io/en/latest/Introduction/
for pdfs: grobid is afaik the best at metadata (author/title) extraction (or the semanticscholar and mendeley APIs are best if they know of a pdf already), I haven't tested it on content. But even grobid is mediocre at metadata extraction. see also https://csxstatic.ist.psu.edu/downloads/software for other filetypes, there exist packages with support for lots of file formats: https://www.gnu.org/software/libextractor/ https://textract.readthedocs.io/en/stable/ https://tika.apache.org/