ocrs
ocrs copied to clipboard
Add script to evaluate perfomance on SROIE dataset
Related to the discussion in https://github.com/robertknight/ocrs/issues/43 this adds a script to evaluate on the SROIE 2019 dataset (scanned recipes). I wanted to do end-to-end eval, and needed the executable, so it seemed easier to put it here rather than in https://github.com/robertknight/ocrs-models.
But feel free to close, I was mostly curious about the results.
To run this script:
- Install dependencies:
- pip install scikit-learn datasets tqdm (I saw there are some metrics in orcs-models, but for text vectorization it seemed easier to use scikit-learn)
- Optionally install pytesseract + tesseract
- Run,
python tools/evaluate-sroie.py
which produces (on the first 100 images / ~230)
Evaluating on SROIE 2019 dataset...
- ORCS: 1.45 s / image, precision 0.96, recall 0.84, F1 0.90
- Tesseract: 0.84 s / image, precision 0.36, recall 0.34, F1 0.35
The precision, recall scores are computed globally on the text extracted from the image, after tokenizing with scikit-learn's vectorizer.
So overall the scores look quite good! I'm not sure, maybe I'm not using tesseract right, it's performance looks pretty bad on this dataset. Or maybe it needs some pre-processing.
Run time is a bit slower than tesseract, but I imagine that could always be improved somewhat.