OCR4wikisource icon indicating copy to clipboard operation
OCR4wikisource copied to clipboard

Prohibit do_ocr.py when ONLY .txt files remain in the root folder

Open ravidreams opened this issue 9 years ago • 4 comments

Prohibit do_ocr.py when ONLY .txt files remain and .upload or .log files are NOT available in the root folder. We had one instance when a user tried to run do_ocr.py again when his connection was lost midway. He was actually uploading OCRed paged already to WS and had some text files remaining. So, he ended up overwriting already existing pages. And strangely, Google provided Gibberish this time - https://ta.wikisource.org/w/index.php?title=Page%3A%E0%AE%A4%E0%AE%A9%E0%AE%BF_%E0%AE%B5%E0%AF%80%E0%AE%9F%E0%AF%81.pdf%2F79&type=revision&diff=97935&oldid=96015

The user should be prompted to run mediawiki_uploader.py after he makes sure that these pages are missing in WS index page.

ravidreams avatar Feb 16 '16 10:02 ravidreams

I had the same problem just now. After do_ocr succeeded, it started uploading files to wikisource. That process failed after 316 pages. I restarted do_cor.py again thinking, it would start from where it had stopped. Instead it started the OCR of page 1 again.

What is the recommended workflow in such cases?

Shreeshrii avatar Oct 14 '17 03:10 Shreeshrii

@Shreeshrii just run as

python mediawiki_uploader.py

This will do the upload work only.

Will fix the issue detailed by @ravidreams soon.

tshrinivasan avatar Oct 14 '17 05:10 tshrinivasan

@tshrinivasan Thanks!

Will upload work, if I create the OCRed files locally on my PC using tesseract?

Shreeshrii avatar Oct 14 '17 10:10 Shreeshrii

Currently no.

But can give you as a separate script.

Raise a new issue with your detailed requirements along with tesseracts output filename patterns.

Does tesseract support sa language ?

On Oct 14, 2017 4:11 PM, "Shreeshrii" [email protected] wrote:

@tshrinivasan https://github.com/tshrinivasan Thanks!

Will upload work, if I create the OCRed files locally on my PC using tesseract?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tshrinivasan/OCR4wikisource/issues/62#issuecomment-336626602, or mute the thread https://github.com/notifications/unsubscribe-auth/ABNbOJfEIQe4alLD1ivGDzVepkvoiL_bks5ssI_ogaJpZM4HbCX4 .

tshrinivasan avatar Oct 14 '17 14:10 tshrinivasan