OCR4wikisource icon indicating copy to clipboard operation
OCR4wikisource copied to clipboard

Check for already OCRed and uploaded books

Open ravidreams opened this issue 9 years ago • 5 comments

As we are getting more hands to use this tool, we are observing two people trying to upload the same book or uploading already existing books. This results in two issues:

  1. Unnecessary processig time and then user wondering why his edits are not seen.
  2. Some pages being overwritten when there are small differences in the OCR output from Google. Srikanth Lakshmanan faced this issue. Since this is minor variance and not 100% better always, it is not worthy to overwrite the pages again with a new OCR text.

I suggest that before do_ocr.py runs, it should first check if a page like Page:Book-name.pdf/1 exists. If it exists, it should inform the user and suggest to try another book.

ravidreams avatar Feb 13 '16 18:02 ravidreams

@ravidreams please think about other Indic wikisource regarding this issue. In most of the other wikisource have started manual proofread with page 1, 2, 3... but not completed. As of now no proofreading stats is zero. So this is ok for TAWS to check page 1, not for all other wiki.

You need to co-ordinate with each other. In BNWS we 3 people working on OCR, we have discussed each other, which set will do by whom.

jayantanth avatar Feb 14 '16 03:02 jayantanth

@jayantanth In future, the tool can be used by anyone for any book in any language wikisource. So, all time coordination is not possible and the tool should eliminate manual errors.

Will it be OK if we check for last page of the book instead of the first page? But, then this check can happen only when the book is downloaded and sliced into single pages. So, before OCR starts this should be checked as doing OCR is the time consuming part. The tool can still can give an option to continue doing OCR if the user is sure about what (s)he is doing.

Or, please suggest any other logic which will avoid duplicate effort and overwriting of existing OCRed pages.

ravidreams avatar Feb 14 '16 12:02 ravidreams

Also, this check is important when tool goes to the next step of OCRing multiple files at a go instead of changing config.ini every time for the next book.

ravidreams avatar Feb 14 '16 12:02 ravidreams

@ravidreams Agreed with you.

jayantanth avatar Feb 14 '16 12:02 jayantanth

@ravidreams, Purging the index file after OCR completion will make the file red in the list of Index pages. Thus, users can easily check the list of index files about the status of OCR in a Index file. (For example, in Bengali Wikisource, https://bn.wikisource.org/w/index.php?title=%E0%A6%AC%E0%A6%BF%E0%A6%B6%E0%A7%87%E0%A6%B7:IndexPages&limit=500&offset=0&key=&order= )

Issue #74 - Purge the index file after OCR is completed

bodhisattwawiki avatar Feb 21 '16 19:02 bodhisattwawiki