tesseract
tesseract copied to clipboard
Failed to extract text~~
Hi all,
I tried this package to extract text from a simple picture, however, the results are not as good as expected, here is my pic with 300dpi:
library(tesseract)
eng <- tesseract("eng")
results2 <- tesseract::ocr_data("path_ziji_600.png", engine = eng)
results2
# A tibble: 18 x 3
word confidence bbox
<chr> <dbl> <chr>
1 Hippo 95.1 205,207,572,344
2 pathway 96.1 614,207,1122,344
3 DCHS1/2 88.9 468,515,919,631
4 FAT 16.7 1237,522,1427,601
5 1/2/3/4 16.7 1426,515,1762,631
6 TAOK 18.4 2037,514,2314,630
7 1/2/3 18.4 2313,514,2581,630
8 SAV1 82.5 1061,1156,1316,1237
9 || 83.3 1345,1132,1525,1276
10 STK3/4 92.3 1573,1154,1933,1237
11 -- 55.0 2098,1288,2286,1344
12 CRB1/2 78.2 485,2040,860,2189
13 an 33.0 1795,1975,2077,2200
14 TEAD2 91.9 1311,2624,1671,2704
15 Cell 96.2 703,2995,891,3078
16 proliferation 95.7 922,2995,1507,3101
17 and 96.5 1537,2995,1702,3078
18 differentiation 96.0 1734,2995,2398,3078
Actually, it does not recognize every text in this picture! A little strange (because this pic is not complex at all), anyone could give me some suggestions about this?
Thanks a lot^_^
Bests, Shisheng
> sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
Matrix products: default
locale:
[1] LC_COLLATE=Chinese (Simplified)_People's Republic of China.936 LC_CTYPE=Chinese (Simplified)_People's Republic of China.936
[3] LC_MONETARY=Chinese (Simplified)_People's Republic of China.936 LC_NUMERIC=C
[5] LC_TIME=Chinese (Simplified)_People's Republic of China.936
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] magick_2.1 tesseract_4.1 DO.db_2.9 AnnotationDbi_1.46.0 IRanges_2.18.1 S4Vectors_0.22.0
[7] Biobase_2.44.0 BiocGenerics_0.30.0
We only provide R bindings, to discuss the OCR engine you need to open an issue upstream at https://github.com/tesseract-ocr/tesseract.
Maybe you can pass control parameters to improve your results? Have you looked into tesseract_params()
?
thanks, Jeroen. I have checked these parameters in tesseract_params(), but there are too many~~ I think these black/grey boxes in this pic affect the accuracy~~