tesseract very slow in R

Open randomgambit opened this issue 5 years ago • 1 comments

Hello there,

Thanks for this amazing binding! I am running into some performance issues and I wonder if you have some hints or ideas.

Basically, the R wrapper works fine but it is very slow. I tried to use furrr and multiprocessing but I have read on the internet that it is not that easy to run many tesseract processing in parallel. Is that true? were you able to run tesseract in parallel already?

Thanks~

Feb 13 '21 19:02 randomgambit

Hi Randomgambit, I have run tesseract in parallel on Windows and it seems to perform pretty well. I tested a 47 page pdf both with and without parallel processing. The function using parallel processing appears to be approximately 70% faster. I've included my code below.

Hope this is helpful!

parallel_ocr <- function(x) {
  pdf_split <- as.list(pdftools::pdf_split(x, "./images/split/"))
  cl <- makeCluster(detectCores())
  clusterEvalQ(cl, {library(pdftools); library(tesseract)})
  clusterExport(cl, c("pdf_convert", "ocr"))
  
  png_file <- parLapplyLB(cl, pdf_split, pdf_convert, dpi = 150)
  
  text <- parLapplyLB(cl, png_file, ocr)
  
}

Mar 09 '22 22:03 morgan-dgk