pdftools icon indicating copy to clipboard operation
pdftools copied to clipboard

error reporting in pdf_text

Open sckott opened this issue 10 years ago • 14 comments

Errors are a little hard to parse, they are call combined into one string. Though maybe this is is good enough?

an example

download.file("https://github.com/sckott/scott/raw/gh-pages/pdfs/Chamberlain%26Rudgers2011EvolEcol.pdf", 
              "paper.pdf")
pdftools::pdf_text('paper.pdf')

#> poppler/error: Invalid shared object hint table offsetpoppler/error: Failed to get object num from 
#> hint tables for page 1poppler/error: Failed parsing page 1 using hint tablespoppler/error: Failed to 
#> get object num from hint tables for page 1poppler/error: Failed parsing page 1 using 
#> hint tablespoppler/error: Failed to get object num from hint tables for page 1poppler/error: 
#> Failed parsing page 1 using hint tables
#> ... cutoff

sckott avatar Feb 25 '16 18:02 sckott

and wonder if this is something that can be fixed in pdftools, or if the pdf itself is malformed

sckott avatar Feb 25 '16 18:02 sckott

This is very difficult. The errors messages printed to stderr by the C library, we never actually get them in R. When compiling libpoppler you can configure some settings on how it deals with errors, but that is usually beyond our control from the R interface.

jeroen avatar Feb 25 '16 18:02 jeroen

Not sure if why this file is giving these errors. I'll ask on the poppler mailing list.

jeroen avatar Feb 25 '16 19:02 jeroen

If you find more problematic PDF files can you add them to this issue? That is very helpful for testing / debugging.

jeroen avatar Feb 25 '16 19:02 jeroen

Yep, will do

sckott avatar Feb 25 '16 19:02 sckott

OK I found a way to set a custom error callback: https://github.com/jeroenooms/pdftools/commit/0060f146675820a6edf9b107b8f5ef0ed1220840. So the parsing errors now show up in R as messages, which is much nicer.

Still don't know why your pdf is giving so many errors though.

jeroen avatar Feb 25 '16 20:02 jeroen

Awesome. Have been reviewing a paper, will get more egs soon...

sckott avatar Feb 25 '16 21:02 sckott

another eg

download.file("https://github.com/sckott/scott/raw/gh-pages/pdfs/ChamberlainEtal2010Oecologia_journalcopy.pdf", 
              "paper.pdf")
pdftools::pdf_text('paper.pdf')
#> error: Invalid shared object hint table offset
#> error: Failed to get object num from hint tables for page 1
#> error: Failed parsing page 1 using hint tables
#> error: Failed to get object num from hint tables for page 1
#>  ...  cutoff

sckott avatar Feb 25 '16 23:02 sckott

Are you getting the same errors with the pdftotext command line utility (which is included with the same poppler package from brew)?

pdftotext ChamberlainEtal2010Oecologia_journalcopy.pdf

jeroen avatar Feb 25 '16 23:02 jeroen

Yes, same errors,

Syntax Warning: Invalid shared object hint table offset
Syntax Warning: Failed to get object num from hint tables for page 1
Syntax Warning: Failed parsing page 1 using hint tables

...

sckott avatar Feb 25 '16 23:02 sckott

All of these give errors of various kinds https://github.com/sckott/pdftoolspdfs - let me know if you want me to paste in the errors

sckott avatar Feb 25 '16 23:02 sckott

But you still get the text correct, even though there were parsing errors on some of the elements? Things like watermarks seem to cause conversion errors, but all the main text should be there?

jeroen avatar Feb 25 '16 23:02 jeroen

Sorry, yes, the text does come back fine

sckott avatar Feb 25 '16 23:02 sckott

Probably the same issue (the result looks good in both cases)

library(pdftools) # (pdftools    * 2.2     2019-03-10 [1] CRAN (R 3.5.3))
download.file("http://www.staedtestatistik.de/fileadmin/vdst/Dortmund2019/503FJT2019_RShiny.pdf", "paper.pdf", mode = "wb")

bitmap <- pdftools::pdf_render_page('paper.pdf')
#> PDF error: Invalid least number of objects reading page offset hints table
str(bitmap, 1)
#>  'bitmap' raw [1:4, 1:842, 1:595] ff ff ff ff ...

txt <- pdf_text("paper.pdf")
#> PDF error: Invalid least number of objects reading page offset hints table
str(txt, 1)
#>  chr [1:14] "  Nutzung von R        R-Shiny-Apps     Zeitreihenapp      Tourenplanung Verteilungsvergleiche Fazit\r\nVDSt Fr"| __truncated__ ...

Created on 2019-06-06 by the reprex package (v0.3.0)

patperu avatar Jun 06 '19 20:06 patperu