tesseract TIFF Read Encoded Strip Error

Hi, When i loop over my PDFs and use OCR_Data, after a while (about 2 hours) it produces the following error:

TIFFReadEncodedStrip Error

Read error at scanline 0; got 0 bytes, expected 9918

OK

It is a popup in windows - it is NOT an R error. After hitting "OK" it doesn't go away and the same popup occurs. Thus it completely stops my script.

I did see another issue raised from using OCR_Data about memory leak - perhaps this is related?

Thanks Oneiricer

Jan 09 '19 02:01 Oneiricer

Oh that's bad. can you give an example tiff file so I can test it ?

Jan 09 '19 10:01 jeroen

Hi jeroen,

Sure, here is the page/tiff file that it failed over. I suspect when you run the OCR_Data function it will work - it seems to fail randomly.

ag2018_5591_21.zip

Jan 10 '19 05:01 Oneiricer

Hmm I cannot reproduce this, it seems fine for me. Can you please show your sessionInfo() and a screenshot of what you are seeing?

Imgur

Jan 11 '19 11:01 jeroen

Hi jeroen,

Sorry for taking so long to get back to you. I use ocr_data instead of OCR, maybe that might produce a different result?

This is what happens when i let it run for about 2 hours:

Note when i hit the "OK" button it then creates another error, the strip number just increases by 1. It is an endless loop error. I actually need to terminate R so i can't run the sessionInfo().

Jan 15 '19 22:01 Oneiricer

Could you help me by trying to extract some code with a minimal example for me to see this problem? It is very difficult for me right now to reproduce this on my machine based on your screenshots.

Jan 15 '19 22:01 jeroen

Hi Jeroen,

Here's all my code and all the PDF's. Thankfully there are no privacy concerns from my company around sharing these PDFs - they are publicly available already. If you could please extract into a seperate folder and run the code, see if you get the same error that i do. the error only appears about an hour or two into the process- i have a feeling it is either a network issue or a PC issues. Apologies for my bad code - i am still very new to R.

link to PDF (70 megs)

https://drive.google.com/open?id=1GCfs4uHa-Zx9V3SEBz8CaNI5zmJtC5o0


library(tidytext)
library(tesseract)
library(quanteda)
library(dplyr)
library(magrittr)
library(RODBC)
library(stringr)
library(tidyr)
dest <-"H:\\R\\R scrap" 


#List of PDF files
myfiles <- list.files(path = dest, pattern = "pdf",  full.names = TRUE)

#File names Only
myfiles_names <- list.files(path = dest, pattern = "pdf",  full.names = FALSE)


#cn <- odbcDriverConnect(connection="Driver={SQL Server Native Client 11.0};server=vicvit138wtsql;database=R_DB;trusted_connection=yes;")


#Loop through all PDF files in folder



loop1 = 0
for (a in myfiles) { 
                  
                  loop1 <- loop1 + 1 #Keep track of where the first loop is
                  nam <- myfiles_names[loop1] %>% str_sub(., end=-5)  #get file name and remove .pdf
                  pages <- pdftools::pdf_info(myfiles[loop1])$pages #Read the number of pages
                  
                  loop2 <- 0
                    for (b in pages) {
                                      
                                      start.time <- Sys.time()
                                      
                                      while(loop2 < b) {
                                                    
                                                    loop2 <- loop2 + 1
                                                    
                                                    img_file <- pdftools::pdf_convert(a, format = 'tiff', dpi = 400, pages = loop2) #pages = 1
                                                    raw_text <- ocr_data(img_file)
                                                    raw_text <- data.frame(raw_text$word, raw_text$confidence) #remove bbox
                                                    raw_text <- raw_text %>% mutate(Word_Number = row_number()) #add word number
                                                    raw_text <- raw_text %>% mutate(Page = loop2) #add page number
                                                    raw_text <- raw_text %>% mutate(Matter_Number = nam) #add name of file to table
                                                    raw_text <- raw_text %>% mutate(System_Time = start.time) # Add system time
                                                    
                                                    #sqlSave(cn, dat = raw_text, tablename = "RawPDFText", verbose=FALSE, fast=TRUE, append=TRUE, rownames = FALSE)
                                                    remove(img_file)
                                                    }
                                      
                      
                      
                                      }

  
}

Jan 15 '19 23:01 Oneiricer

Hi Jeroen,

I tried to re-run the same script using the same PDF files on my home beefier PC and got the same issue. I hope you can reproduce the problem on your end.

Thanks Oneiricer

Jan 16 '19 21:01 Oneiricer

Can you please try to reduce the example, and narrow it down to only include the R call and pdf file that actually causes the problem?

Jan 16 '19 21:01 jeroen

I don't think it is that simple. I have 30 PDF documents - each time i run it, the error comes up for a different PDF, on a different page. However when OCR the error page individually, the error does not occur. If i run for a smaller subset of PDFs it would work, but my organisation averages way more than 30 PDF documents so i need to be able to scale upwards if anything.

It is almost like there is a bug in memory or something that is jamming it up when looping through the files.

Jan 16 '19 21:01 Oneiricer

after working on this a bit more, i've been a little bit more successful. i included a garbage collection gc() into my loop and it ran significantly longer - from 1 hour to roughly 4 hours.

Interestingly it still crashed with the following:

Jan 17 '19 03:01 Oneiricer

I have released a new version of pdftools today that fixes some memory leaks. Perhaps you can try updating to pdftools 2.1.

On Wed, Jan 16, 2019, 21:54 Oneiricer <[email protected] wrote:

after working on this a bit more, i've been a little bit more successful. i included a garbage collection gc() into my loop and it ran significantly longer - from 1 hour to roughly 4 hours.

Interestingly it still crashed with the following:

[image: image] https://user-images.githubusercontent.com/45953330/51294597-bdd02800-1a67-11e9-88db-3f6083d1d372.png

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ropensci/tesseract/issues/38#issuecomment-455033455, or mute the thread https://github.com/notifications/unsubscribe-auth/AANM_yWvIHj2_22leC2o7rpBYIwkX2xcks5vD_PngaJpZM4Z2pQq .

Jan 17 '19 03:01 jeroen

Thank you Jeroen. i just saw you're maintainer/author for a few other packages. Thanks a lot for you work, I definitely appreciate it.

Will the binaries be available on CRAN soon? i am not familiar with compiling source code.

Jan 17 '19 04:01 Oneiricer

The binaries are on cran now. Can you try install.packages("pdftools") and see if the problem as disappeared?

Jan 18 '19 15:01 jeroen

Hi Jeroen, i've updated pdftools but the problem still persists. Again, runs for about an hour or two before throwing up that error. I also note that the system processing seems to dedicate about 98% to R Studio during this time. It appears i have the same issue as issue #37

Jan 21 '19 09:01 Oneiricer

Just wondering, is there a way to use ocr and filter out the words that have less than 90 confidence?

Jan 21 '19 12:01 Oneiricer

@Oneiricer I've updated issue #37 with a work around which might be helpful to your problem.

Also, once you have the confidence levels you can always select the rows in the data frame returned from the ocr_data method to keep rows with confidence levels > 90.

Jan 23 '19 16:01 JoelSPendery

Hi @JoelSPendery , Thanks for the excellent tip. I will look into using a batch file, havn't had any experience doing this. Will be a good learning exercise.

I'm aware i can filter and keep records with >90 if there is a data frame. Not sure if that is possible with just the ocr command though. As issue #37 states the memory leak issue only occurs with ocr_data.

thank you once again!

Jan 23 '19 22:01 Oneiricer

@Oneiricer yea, I've been using the batch file to get around the memory leak with the ocr_data command, but if you're right...if you don't have the data frame it's difficult. You might check on the options that you can pass in creating an engine, I'm not sure if there is a confidence threshold option though. You can check all the parameters by running

tesseract_params("")

Maybe one of them is useful.

Jan 23 '19 22:01 JoelSPendery

Hmm I really don't think there is a memory leak. If I run gc() and check how much MB is in use. Then I ran repeat {example(ocr)} for a few minutes and afterwards gc() shows about the same about of memory in use.

Jan 26 '19 22:01 jeroen

Memory leak only occurs with ocr_data

Jan 26 '19 22:01 Oneiricer

@stweil how do I destroy this iterator tesseract::ResultIterator* ri = api->GetIterator();? Can I just delete the pointer?

Jan 26 '19 23:01 jeroen

Yes, that's one possible solution. See https://github.com/tesseract-ocr/tesseract/blob/master/src/api/hocrrenderer.cpp#L179 for the more modern variant.

Jan 27 '19 10:01 stweil

@Oneiricer @JoelSPendery could you test if the problem is fixed in the dev version?

remotes::install_github("ropensci/tesseract")

Jan 29 '19 03:01 jeroen

whenever i try to install development versions of any packages, i keep getting errors

Error: Could not find tools necessary to compile a package In addition: Warning messages: 1: In untar2(tarfile, files, list, exdir) : skipping pax global extended headers 2: In untar2(tarfile, files, list, exdir) : skipping pax global extended headers

It always asks me to download and run Rtools even though i already have the latest version.

Feb 01 '19 00:02 Oneiricer

I seem to get another error when trying at my home PC:

Installing package into �C:/Users/Ken/Documents/R/win-library/3.5� (as �lib� is unspecified) ERROR: dependency 'curl' is not available for package 'tesseract'

removing 'C:/Users/Ken/Documents/R/win-library/3.5/tesseract' In R CMD INSTALL installation of package �C:/Users/Ken/AppData/Local/Temp/RtmpWk7FsF/file1ed059013c2/tesseract_4.0.tar.gz� had non-zero exit status

I try to run: remove.packages("curl")

Removing package from �C:/Users/Ken/Documents/R/win-library/3.5� (as �lib� is unspecified) Error in remove.packages : there is no package called ‘curl’

Can you release the binaries onto CRAN?

Feb 07 '19 10:02 Oneiricer

hmm, i was able to run: devtools::install_github("ropensci/tesseract")

but upon updating it still says version 4.0.

Feb 07 '19 11:02 Oneiricer

tesseract tesseract copied to clipboard

TIFF Read Encoded Strip Error

TIFFReadEncodedStrip Error

Read error at scanline 0; got 0 bytes, expected 9918

tesseract
tesseract copied to clipboard