tesseract icon indicating copy to clipboard operation
tesseract copied to clipboard

Feature Request: Get all characters with confidence >x

Open billdenney opened this issue 5 years ago • 1 comments

This is related to #8 and #39 (or more accurately, the underlying ideas within them).

With the upstream issue that the whitelist and blacklist are not implemented in tesseract 4 (discussed in #39), it is difficult to extract all-numeric values. More generally, I have some text that follows very rigid formatting with columns of person identifiers (that are a mix of alpha-numeric and dash characters) and floating point numbers. The person identifiers will be hard to limit the values for, but the floating point numbers are easy as they come from the set 0-9, ".", and "-".

Is it possible within the ocr_data() function to get a vector of all characters that matched with >x confidence and the confidence values of those characters (where x is input by the user)?

That way, I could manually implement whitelist or blacklist functionality.

billdenney avatar Apr 08 '19 09:04 billdenney

The whitelist / blacklist options are now supported in tesseract 4.1.

jeroen avatar Jul 25 '19 20:07 jeroen