OCR4wikisource
OCR4wikisource copied to clipboard
Prohibit do_ocr.py when ONLY .txt files remain in the root folder
Prohibit do_ocr.py when ONLY .txt files remain and .upload or .log files are NOT available in the root folder. We had one instance when a user tried to run do_ocr.py again when his connection was lost midway. He was actually uploading OCRed paged already to WS and had some text files remaining. So, he ended up overwriting already existing pages. And strangely, Google provided Gibberish this time - https://ta.wikisource.org/w/index.php?title=Page%3A%E0%AE%A4%E0%AE%A9%E0%AE%BF_%E0%AE%B5%E0%AF%80%E0%AE%9F%E0%AF%81.pdf%2F79&type=revision&diff=97935&oldid=96015
The user should be prompted to run mediawiki_uploader.py after he makes sure that these pages are missing in WS index page.
I had the same problem just now. After do_ocr succeeded, it started uploading files to wikisource. That process failed after 316 pages. I restarted do_cor.py again thinking, it would start from where it had stopped. Instead it started the OCR of page 1 again.
What is the recommended workflow in such cases?
@Shreeshrii just run as
python mediawiki_uploader.py
This will do the upload work only.
Will fix the issue detailed by @ravidreams soon.
@tshrinivasan Thanks!
Will upload work, if I create the OCRed files locally on my PC using tesseract?
Currently no.
But can give you as a separate script.
Raise a new issue with your detailed requirements along with tesseracts output filename patterns.
Does tesseract support sa language ?
On Oct 14, 2017 4:11 PM, "Shreeshrii" [email protected] wrote:
@tshrinivasan https://github.com/tshrinivasan Thanks!
Will upload work, if I create the OCRed files locally on my PC using tesseract?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tshrinivasan/OCR4wikisource/issues/62#issuecomment-336626602, or mute the thread https://github.com/notifications/unsubscribe-auth/ABNbOJfEIQe4alLD1ivGDzVepkvoiL_bks5ssI_ogaJpZM4HbCX4 .