Full latin support?

Open richarddd opened this issue 3 months ago • 1 comments

Hi,

First of all, great work!

Is it possible to use ocrs to detect extended ascii? Swedish, Polish etc are all latin but have more chars than the 97 allowed. Passing a custom alphabet with those chars just causes an error:

Error: model output had unexpected type or shape: output column count (97) does not match alphabet size (96)

Oct 18 '25 13:10 richarddd

The custom alphabet option in this library is tied to the text recognition model being used. To recognize additional characters you need to find or create a custom model and use an alphabet which matches. I'm not aware of anyone who has already done this for the languages you mentioned.

The recognition model predicts a probability distribution over N + 1 classes for each horizontal position in a text line, where N is the alphabet size. The original alphabet is defined here in the model training repo. If the training data set already includes examples of the characters you want to recognize, it might be enough to just change the alphabet used in the training process. If not, you'd have to expand the data set with new images that do have these characters.

Oct 18 '25 15:10 robertknight