tesseract icon indicating copy to clipboard operation
tesseract copied to clipboard

Failed to extract text~~

Open wangshisheng opened this issue 5 years ago • 2 comments

Hi all,

I tried this package to extract text from a simple picture, however, the results are not as good as expected, here is my pic with 300dpi: path_ziji_300

library(tesseract)
eng <- tesseract("eng")
results2 <- tesseract::ocr_data("path_ziji_600.png", engine = eng)
results2
# A tibble: 18 x 3
   word            confidence bbox               
   <chr>                <dbl> <chr>              
 1 Hippo                 95.1 205,207,572,344    
 2 pathway               96.1 614,207,1122,344   
 3 DCHS1/2               88.9 468,515,919,631    
 4 FAT                   16.7 1237,522,1427,601  
 5 1/2/3/4               16.7 1426,515,1762,631  
 6 TAOK                  18.4 2037,514,2314,630  
 7 1/2/3                 18.4 2313,514,2581,630  
 8 SAV1                  82.5 1061,1156,1316,1237
 9 ||                    83.3 1345,1132,1525,1276
10 STK3/4                92.3 1573,1154,1933,1237
11 --                   55.0 2098,1288,2286,1344
12 CRB1/2                78.2 485,2040,860,2189  
13 an                    33.0 1795,1975,2077,2200
14 TEAD2                 91.9 1311,2624,1671,2704
15 Cell                  96.2 703,2995,891,3078  
16 proliferation         95.7 922,2995,1507,3101 
17 and                   96.5 1537,2995,1702,3078
18 differentiation       96.0 1734,2995,2398,3078

Actually, it does not recognize every text in this picture! A little strange (because this pic is not complex at all), anyone could give me some suggestions about this?

Thanks a lot^_^

Bests, Shisheng

> sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=Chinese (Simplified)_People's Republic of China.936  LC_CTYPE=Chinese (Simplified)_People's Republic of China.936   
[3] LC_MONETARY=Chinese (Simplified)_People's Republic of China.936 LC_NUMERIC=C                                                   
[5] LC_TIME=Chinese (Simplified)_People's Republic of China.936    

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] magick_2.1           tesseract_4.1        DO.db_2.9            AnnotationDbi_1.46.0 IRanges_2.18.1       S4Vectors_0.22.0    
[7] Biobase_2.44.0       BiocGenerics_0.30.0

wangshisheng avatar Aug 12 '19 10:08 wangshisheng

We only provide R bindings, to discuss the OCR engine you need to open an issue upstream at https://github.com/tesseract-ocr/tesseract.

Maybe you can pass control parameters to improve your results? Have you looked into tesseract_params()?

jeroen avatar Aug 12 '19 19:08 jeroen

thanks, Jeroen. I have checked these parameters in tesseract_params(), but there are too many~~ I think these black/grey boxes in this pic affect the accuracy~~

wangshisheng avatar Aug 13 '19 00:08 wangshisheng