trafilatura icon indicating copy to clipboard operation
trafilatura copied to clipboard

Extract content from formats other than HTML: PDF, EPUB?

Open adbar opened this issue 2 years ago • 8 comments

So far Trafilatura focuses on HTML documents, should it be able to extract information from PDF files?

adbar avatar Jul 28 '21 14:07 adbar

I think it's a good idea to create a function for extracting pdf content, I have several sites that return PDF's to me.

Do you have an idea how to get started? For example how to identify the column that separates the text? Or how to extract content from PDF?

pdfplumber seems interesting to me to have an idea of how to get started, because sometimes it is necessary to convert images to text as well.

felipehertzer avatar Jul 29 '21 08:07 felipehertzer

There are already libraries doing the job so I would start by adding one of them, e.g. pdfplumber yes. In any case I think we would need a roadmap for the integration to go smoothly, the lines to change are scattered around the code.

adbar avatar Jul 29 '21 10:07 adbar

Sounds good for me, it would be good to run some tests to predict what we need to have on the roadmap

felipehertzer avatar Jul 29 '21 11:07 felipehertzer

OK, we'll see about that, let's wait for further feedback first.

adbar avatar Jul 30 '21 10:07 adbar

Other ideas concerning the EPUB format this time:

  • Support for many formats, including PDF & EPUB: https://github.com/pymupdf/PyMuPDF
  • EPUB metadata: https://github.com/paulocheque/epub-meta

adbar avatar Sep 24 '21 14:09 adbar

I was using some pdf extractor libraries (pdfminer) but it could not deal with layout and hyphenation. I think allowing interaction with grobid might be a better solution for most use-cases. It turns a PDF to XMLTEI using statistical models. I'm currently using trafilatura xmltotext to convert the xmltei body to txt as well.

But either way I think handling pdfs is very important as well!

oguzserbetci avatar Oct 06 '21 09:10 oguzserbetci

re grobid: https://grobid.readthedocs.io/en/latest/Introduction/

amirouche avatar Nov 08 '21 15:11 amirouche

for pdfs: grobid is afaik the best at metadata (author/title) extraction (or the semanticscholar and mendeley APIs are best if they know of a pdf already), I haven't tested it on content. But even grobid is mediocre at metadata extraction. see also https://csxstatic.ist.psu.edu/downloads/software for other filetypes, there exist packages with support for lots of file formats: https://www.gnu.org/software/libextractor/ https://textract.readthedocs.io/en/stable/ https://tika.apache.org/

acertain avatar Jan 24 '22 17:01 acertain