tesseract
tesseract copied to clipboard
TIFF Read Encoded Strip Error
Hi, When i loop over my PDFs and use OCR_Data, after a while (about 2 hours) it produces the following error:
TIFFReadEncodedStrip Error
Read error at scanline 0; got 0 bytes, expected 9918
OK
It is a popup in windows - it is NOT an R error. After hitting "OK" it doesn't go away and the same popup occurs. Thus it completely stops my script.
I did see another issue raised from using OCR_Data about memory leak - perhaps this is related?
Thanks Oneiricer
Oh that's bad. can you give an example tiff file so I can test it ?
Hi jeroen,
Sure, here is the page/tiff file that it failed over. I suspect when you run the OCR_Data function it will work - it seems to fail randomly.
Hmm I cannot reproduce this, it seems fine for me. Can you please show your sessionInfo()
and a screenshot of what you are seeing?
Hi jeroen,
Sorry for taking so long to get back to you. I use ocr_data instead of OCR, maybe that might produce a different result?
This is what happens when i let it run for about 2 hours:
Note when i hit the "OK" button it then creates another error, the strip number just increases by 1. It is an endless loop error. I actually need to terminate R so i can't run the sessionInfo().
Could you help me by trying to extract some code with a minimal example for me to see this problem? It is very difficult for me right now to reproduce this on my machine based on your screenshots.
Hi Jeroen,
Here's all my code and all the PDF's. Thankfully there are no privacy concerns from my company around sharing these PDFs - they are publicly available already. If you could please extract into a seperate folder and run the code, see if you get the same error that i do. the error only appears about an hour or two into the process- i have a feeling it is either a network issue or a PC issues. Apologies for my bad code - i am still very new to R.
link to PDF (70 megs)
https://drive.google.com/open?id=1GCfs4uHa-Zx9V3SEBz8CaNI5zmJtC5o0
library(tidytext)
library(tesseract)
library(quanteda)
library(dplyr)
library(magrittr)
library(RODBC)
library(stringr)
library(tidyr)
dest <-"H:\\R\\R scrap"
#List of PDF files
myfiles <- list.files(path = dest, pattern = "pdf", full.names = TRUE)
#File names Only
myfiles_names <- list.files(path = dest, pattern = "pdf", full.names = FALSE)
#cn <- odbcDriverConnect(connection="Driver={SQL Server Native Client 11.0};server=vicvit138wtsql;database=R_DB;trusted_connection=yes;")
#Loop through all PDF files in folder
loop1 = 0
for (a in myfiles) {
loop1 <- loop1 + 1 #Keep track of where the first loop is
nam <- myfiles_names[loop1] %>% str_sub(., end=-5) #get file name and remove .pdf
pages <- pdftools::pdf_info(myfiles[loop1])$pages #Read the number of pages
loop2 <- 0
for (b in pages) {
start.time <- Sys.time()
while(loop2 < b) {
loop2 <- loop2 + 1
img_file <- pdftools::pdf_convert(a, format = 'tiff', dpi = 400, pages = loop2) #pages = 1
raw_text <- ocr_data(img_file)
raw_text <- data.frame(raw_text$word, raw_text$confidence) #remove bbox
raw_text <- raw_text %>% mutate(Word_Number = row_number()) #add word number
raw_text <- raw_text %>% mutate(Page = loop2) #add page number
raw_text <- raw_text %>% mutate(Matter_Number = nam) #add name of file to table
raw_text <- raw_text %>% mutate(System_Time = start.time) # Add system time
#sqlSave(cn, dat = raw_text, tablename = "RawPDFText", verbose=FALSE, fast=TRUE, append=TRUE, rownames = FALSE)
remove(img_file)
}
}
}
Hi Jeroen,
I tried to re-run the same script using the same PDF files on my home beefier PC and got the same issue. I hope you can reproduce the problem on your end.
Thanks Oneiricer
Can you please try to reduce the example, and narrow it down to only include the R call and pdf file that actually causes the problem?
I don't think it is that simple. I have 30 PDF documents - each time i run it, the error comes up for a different PDF, on a different page. However when OCR the error page individually, the error does not occur. If i run for a smaller subset of PDFs it would work, but my organisation averages way more than 30 PDF documents so i need to be able to scale upwards if anything.
It is almost like there is a bug in memory or something that is jamming it up when looping through the files.
after working on this a bit more, i've been a little bit more successful. i included a garbage collection gc() into my loop and it ran significantly longer - from 1 hour to roughly 4 hours.
Interestingly it still crashed with the following:
I have released a new version of pdftools today that fixes some memory leaks. Perhaps you can try updating to pdftools 2.1.
On Wed, Jan 16, 2019, 21:54 Oneiricer <[email protected] wrote:
after working on this a bit more, i've been a little bit more successful. i included a garbage collection gc() into my loop and it ran significantly longer - from 1 hour to roughly 4 hours.
Interestingly it still crashed with the following:
[image: image] https://user-images.githubusercontent.com/45953330/51294597-bdd02800-1a67-11e9-88db-3f6083d1d372.png
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ropensci/tesseract/issues/38#issuecomment-455033455, or mute the thread https://github.com/notifications/unsubscribe-auth/AANM_yWvIHj2_22leC2o7rpBYIwkX2xcks5vD_PngaJpZM4Z2pQq .
Thank you Jeroen. i just saw you're maintainer/author for a few other packages. Thanks a lot for you work, I definitely appreciate it.
Will the binaries be available on CRAN soon? i am not familiar with compiling source code.
The binaries are on cran now. Can you try install.packages("pdftools")
and see if the problem as disappeared?
Hi Jeroen, i've updated pdftools but the problem still persists. Again, runs for about an hour or two before throwing up that error. I also note that the system processing seems to dedicate about 98% to R Studio during this time. It appears i have the same issue as issue #37
Just wondering, is there a way to use ocr and filter out the words that have less than 90 confidence?
@Oneiricer I've updated issue #37 with a work around which might be helpful to your problem.
Also, once you have the confidence levels you can always select the rows in the data frame returned from the ocr_data method to keep rows with confidence levels > 90.
Hi @JoelSPendery , Thanks for the excellent tip. I will look into using a batch file, havn't had any experience doing this. Will be a good learning exercise.
I'm aware i can filter and keep records with >90 if there is a data frame. Not sure if that is possible with just the ocr command though. As issue #37 states the memory leak issue only occurs with ocr_data.
thank you once again!
@Oneiricer yea, I've been using the batch file to get around the memory leak with the ocr_data command, but if you're right...if you don't have the data frame it's difficult. You might check on the options that you can pass in creating an engine, I'm not sure if there is a confidence threshold option though. You can check all the parameters by running
tesseract_params("")
Maybe one of them is useful.
Hmm I really don't think there is a memory leak. If I run gc()
and check how much MB is in use. Then I ran repeat {example(ocr)}
for a few minutes and afterwards gc()
shows about the same about of memory in use.
Memory leak only occurs with ocr_data
@stweil how do I destroy this iterator tesseract::ResultIterator* ri = api->GetIterator();
? Can I just delete
the pointer?
Yes, that's one possible solution. See https://github.com/tesseract-ocr/tesseract/blob/master/src/api/hocrrenderer.cpp#L179 for the more modern variant.
@Oneiricer @JoelSPendery could you test if the problem is fixed in the dev version?
remotes::install_github("ropensci/tesseract")
whenever i try to install development versions of any packages, i keep getting errors
Error: Could not find tools necessary to compile a package In addition: Warning messages: 1: In untar2(tarfile, files, list, exdir) : skipping pax global extended headers 2: In untar2(tarfile, files, list, exdir) : skipping pax global extended headers
It always asks me to download and run Rtools even though i already have the latest version.
I seem to get another error when trying at my home PC:
Installing package into �C:/Users/Ken/Documents/R/win-library/3.5� (as �lib� is unspecified) ERROR: dependency 'curl' is not available for package 'tesseract'
- removing 'C:/Users/Ken/Documents/R/win-library/3.5/tesseract' In R CMD INSTALL installation of package �C:/Users/Ken/AppData/Local/Temp/RtmpWk7FsF/file1ed059013c2/tesseract_4.0.tar.gz� had non-zero exit status
I try to run: remove.packages("curl")
Removing package from �C:/Users/Ken/Documents/R/win-library/3.5� (as �lib� is unspecified) Error in remove.packages : there is no package called ‘curl’
Can you release the binaries onto CRAN?
hmm, i was able to run: devtools::install_github("ropensci/tesseract")
but upon updating it still says version 4.0.