tesseract
tesseract copied to clipboard
The text is not recognized from png
I have to pull data from a pdf uploaded at a URL. The pdf is in an image/.png format hence while using the tesseract package few of the lines were not recognized.
The code: library(rvest) library(dplyr) library(pdftools) library(tesseract)
url="https://www.hindustancopper.com/Page/PriceCircular" links=url %>% #reading the html of the url read_html()%>% #fetching out the nodes and the attributes html_nodes("#viewTable li:nth-child(1) a") %>% html_attr("href")%>% #replacing few strings str_replace("../..",'') str(links)
#using pdftools to read the pdf base_url <- 'https://www.hindustancopper.com' event_url <- paste0(base_url, links) event_url
#since the link has a scan copy and not the pdf itself hence using tesseract package pdf_convert(event_url, pages = 1, dpi = 850, filenames = "page1.png") text <- ocr("page1.png") cat(text)
The actual output reads the list of products and its prices as: CONTINUOUS CAST COPPER WIRE ROD 11 MM 44567 CONTINUOUS CAST COPPER WIRE ROD NS 439678 CONTINUOUS CAST COPPER WIRE ROD 16 MM 443056...etc.
The expected output should be: CONTINUOUS CAST COPPER WIRE ROD 11 MM 441567 CATHODE FULL 434122 CONTINUOUS CAST COPPER WIRE ROD NS 439678 CONTINUOUS CAST COPPER WIRE ROD 16 MM 443056...etc
I have tried several times changing the value of dpi argument but that did not help much. What else should be added as an argument to the functions that I might be missing.Thanks in advance!
Which OS do you have? What is your tesseract::tesseract_info()
?
In the example I tried, the image was a bit skewed. You could improve results by rotating it:
url <- 'https://www.hindustancopper.com/Upload/Reports/0-637189269505122500-AnnualReport.pdf'
library(magick)
image_read_pdf(url) %>%
image_rotate(3) %>%
image_ocr() %>%
cat
The docs have some more ideas on how to preprocess the images to improve the OCR performance:
https://docs.ropensci.org/tesseract/articles/intro.html#preprocessing-with-magick
Which OS do you have? What is your
tesseract::tesseract_info()
?In the example I tried, the image was a bit skewed. You could improve results by rotating it:
url <- 'https://www.hindustancopper.com/Upload/Reports/0-637189269505122500-AnnualReport.pdf' library(magick) image_read_pdf(url) %>% image_rotate(3) %>% image_ocr() %>% cat
The docs have some more ideas on how to preprocess the images to improve the OCR performance:
https://docs.ropensci.org/tesseract/articles/intro.html#preprocessing-with-magick
OS: Windows 10 Pro
tesseract::tesseract_info() $datapath [1] "C:\Users\xyz\AppData\Local\tesseract4\tesseract4\tessdata/"
$available [1] "eng" "osd"
$version [1] "4.1.0"
$configs
[1] "alto" "ambigs.train" "api_config" "bigram" "box.train" "box.train.stderr"
[7] "digits" "get.images" "hocr" "inter" "kannada" "linebox"
[13] "logfile" "lstm.train" "lstmbox" "lstmdebug" "makebox" "pdf"
[19] "quiet" "rebox" "strokewidth" "tsv" "txt" "unlv"
[25] "wordstrbox"
Thanks, it did solve the issue to a larger extent but the '|' or '[' generated in front of 'CATHODEFULL' is noticeable. How one should get rid of that?