openbenches.org icon indicating copy to clipboard operation
openbenches.org copied to clipboard

OCR Errors

Open edent opened this issue 7 years ago • 1 comments
trafficstars

Google Cloud Vision keeps inserting İ - "latin capital letter i with dot above (U+0130)" - when it should use a regular I (U+0049).

Will manually fix in database for now - but wonder whether there's a better way?

  • Fix on insert?
  • Fix in JS?
  • Parameters to send to Google?

edent avatar Oct 05 '18 07:10 edent

Automatically detected text is shown below. Please check and edit if needed.

Looking at https://github.com/edent/openbenches/blob/2c7a5ed04b46e21a3e38ccc7e842f06cc49f0f44/www/libs/vision/vision.js#L95 there's line break break before "Please" but it's not displayed. (History of the file shows I added that \n. Apparently I didn't notice that it has no effect on the displayed text. :/ ) Could make that line break actually display on the webpage, make the "Please check" sentence bold and maybe more people would do so.

Fixing on insert would mean that if a submitter deliberately put İ in the text, because there is an İ in the text, then what the submitter entered would be silently changed before being saved. Which would be bad.

Could take the stance that it is so unlikely an inscription will contain İ that the detection of such should always be considered an error. Take what comes back from Google Cloud Vision and replace an instances of İ with I before showing it to the submitter. In the unlikely event the inscription really contains İ, the submitter can fix manually or someone else can later. (Or if you want a challenge, that but only replace İ with I if geotags indicate bench is not in a country where Turkish or Azerbaijani are widely used. https://en.wikipedia.org/wiki/Dotted_and_dotless_I :D )

If https://developer.mozilla.org/en-US/docs/Web/CSS/::spelling-error was supported by anything at all then maybe spelling issues detected by the browser, which presence İ would create, could be made more obvious thus maybe increasing chance of submitter correcting.

arizonagroovejet avatar Oct 09 '18 20:10 arizonagroovejet

This appears to have been resolved - OCR is much higher quality now. There will always be errors, but they'll have to be manually caught.

edent avatar May 23 '23 12:05 edent