go.tesseract
go.tesseract copied to clipboard
Two chars in the BoxText output
The fact that you check for this makes me think you have seen it at least a few times. When the BoxText has like "th" or "ch" where a single rune should be... in my case, I thought it might be the ASCII whitelist I was passing, but it isn't.
Currently, I am doing a nasty hack (that will have to live in a branch forever) where I actually split the bounding box in half and add two of them to the output so that I have a bounding box count to ASCII UTF8 text mapping. But it feels like there might be a bug somewhere deep on this.