Tesseract-OCR-5-Docker
Tesseract-OCR-5-Docker copied to clipboard
Docker Image with latest Tesseract OCR Version 5.x.x built from sources
Tesseract-OCR-5-Docker :scroll:
Docker Image with latest Tesseract OCR Version 5.x.x built from sources.
The sources are pulled from the latest main branch and latest releases of the Tesseract OCR project.
Docker Hub: https://hub.docker.com/r/franky1/tesseract
Usage :hammer_and_wrench:
Pull Docker Image :keyboard:
Pull the docker image from Docker Hub:
docker pull franky1/tesseract
Run Docker Container :keyboard:
Mount your image data to the /tmp directory and run Tesseract OCR container with the required command line options, for example, run Tesseract OCR container with test image:
docker run -it -v ${PWD}/testdata:/tmp --rm franky1/tesseract \
tesseract english.png output --oem 1 -l eng
For the Tesseract command line options, please refer to the Tesseract Manual
Mount more languages :speaking_head:
Test if the mounted languages from your local subfolder /tessdata are available in the Docker container.
Be aware that the local languages overwrite the installed languages in the Docker image. Example here with french language:
docker run -it -v ${PWD}/testdata:/tmp \
-v ${PWD}/tessdata:/usr/local/share/tessdata/ \
--rm franky1/tesseract
Test the mounted languages in the Docker container with a sample image. Example here with french language:
docker run -it -v ${PWD}/testdata:/tmp \
-v ${PWD}/tessdata:/usr/local/share/tessdata/ \
--rm franky1/tesseract \
tesseract french.jpg output --oem 1 -l fra
Alternatively, you can build a new Docker image if you want other languages, see next section.
Build Docker Image yourself :whale:
For details have a look into the Dockerfile.
- Git clone this repo.
- Add your required languages to the languages.txt file.
- (a) Build the docker image from scratch, if you want the latest sources from the
mainbranch.
docker build --tag tesseract .
- (b) Build the docker image from scratch, if you want a specific
releaseversion.
docker build --tag tesseract --build-arg TESSERACT_VERSION=5.0.0 .
- Run Tesseract OCR container with test image:
docker run -it --name tesseract -v ${PWD}/testdata:/tmp --rm \
tesseract tesseract english.png output --oem 1 -l eng
Image conditions :ballot_box_with_check:
- Only supported target for this docker image currently is
linux/amd64. - Working directory for ocr images is
/tmpinside the container. See example above. - Directory for trained data is
/usr/local/share/tessdata/inside the container. See example above. - This image was built without the Tesseract training tools.
- This image currently includes only the following languages:
- English:
tessdata_best > eng.traineddata - German:
tessdata_best > deu.traineddata - If you need other languages, you have to build your own image or mount trained data to the
/usr/local/share/tessdata/directory. See example above.
- English:
Tesseract Trained Data for all available languages :weight_lifting:
- Overview of supported languages https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html
- Trained models with support for legacy and LSTM OCR engine https://github.com/tesseract-ocr/tessdata
- Fast integer versions of trained LSTM models https://github.com/tesseract-ocr/tessdata_fast
- Best (most accurate) trained LSTM models https://github.com/tesseract-ocr/tessdata_best
Further documentation :link:
- Docker Hub: https://hub.docker.com/repository/docker/franky1/tesseract
- Original Tesseract Github Repository: https://github.com/tesseract-ocr/tesseract
- Original Tesseract Documentation: https://tesseract-ocr.github.io/
- Original Tesseract Manual: https://tesseract-ocr.github.io/tessdoc/
- More
tessdata_bestlanguages: https://github.com/tesseract-ocr/tessdata_best
ToDo :white_check_mark:
- [x] Update
README.mdto latest Dockerfile and Usage - [x] Add dependabot on Github
- [x] Add vulnerability scanning in Github Actions with Snyk
- [ ] Add GitHub Action for check container efficiency with Dive https://github.com/MartinHeinz/dive-action
- [ ] Add documentation for GitHub Actions Workflow
- [ ] Add more inline comments in GitHub Actions related files
- [ ] Build image for more targets
- [ ] Building Tesseract with TensorFlow?
- [ ] Building Tesseract with Training tools?
- [ ] Change build in Dockerfile according to instructions in Compiling-GitInstallation.md
Issues :bug:
If you have any bugs or requests regarding this Docker image, please post an issue in this Github Repository.
Project status :heavy_check_mark:
27.07.2022: Docker Image is ready for usage, still some slight improvements possible, sometimes build issues