tesseract
tesseract copied to clipboard
Feature Request: Get all characters with confidence >x
This is related to #8 and #39 (or more accurately, the underlying ideas within them).
With the upstream issue that the whitelist and blacklist are not implemented in tesseract 4 (discussed in #39), it is difficult to extract all-numeric values. More generally, I have some text that follows very rigid formatting with columns of person identifiers (that are a mix of alpha-numeric and dash characters) and floating point numbers. The person identifiers will be hard to limit the values for, but the floating point numbers are easy as they come from the set 0-9, ".", and "-".
Is it possible within the ocr_data()
function to get a vector of all characters that matched with >x confidence and the confidence values of those characters (where x is input by the user)?
That way, I could manually implement whitelist or blacklist functionality.
The whitelist / blacklist options are now supported in tesseract 4.1.