Robert Sachunsky

Results 735 comments of Robert Sachunsky

> Those additional components are based on the components from a Tesseract standard model (as far as I remember on `Fraktur.traineddata`, but I'd have to check) No, the latter word...

> Of course it would be preferable to have a standard dictionary for (say) 18th century German. We could export the fullforms from DTA lexdb, for example. (But this must...

> * >10: 314248 words > * >50: 100516 words > * >100: 60403 words > I will try to use this with frak2021, but also GT4HistOCR and others. Done:...

> In my tests frak2021 is much better than GT4HistOCR, so using it with GT4HistOCR might not be worth the efforts. > It would be more interesting to use it...

Indeed – something went wrong. Thanks @jbarth-ubhd, I'll investigate!

Ok, I found the problem. See [new release](https://github.com/bertsky/dta-lexdb-applications/releases/tag/v0.2). ``` 346632 lines 16.37 % lines with »ſ« 0.19 % lines all-UPPERCASE 132.80 % lines ambigious ``` What's with the > 100%...

> **a lot of spaces** after words(?). wow, I should have checked. Thanks again for being thorough @jbarth-ubhd – much appreciated! see [new release](https://github.com/bertsky/dta-lexdb-applications/releases/tag/v0.3) > And not NFC (double counting,...

> If we want NFC? Don't know. Inserted it just because otherwise I'll don't notice this easily. I can remove this check. I just checked: tesstrain does NFC on the...

[There](https://github.com/bertsky/dta-lexdb-applications/releases/tag/v0.4) we go

> Tesseract also does NFC when generating lstmf files, but I'd like to change that because I want to be able to train models with decomposed umlauts and other characters...