Robert Sachunsky
Robert Sachunsky
> Not looked into details, but https://github.com/maxbachmann/RapidFuzz looks more like a string distance computation without any alignments. It implement its own fast Needleman-Wunsch alignment (based on Hyyrö algorithm or Wagner-Fischer)...
> the normalized Levenshtein distance which is calulated as: > > ``` > 1 - lev_dist / max(len1, len2) > ``` [...] > However in editdistance it should be simple...
> I am a complete noob on these OCR topics. I simply needed a fast implementation and found none, so I did build my own ;) Looks promising, will have...
IIUC returning the total length of the alignment path (i.e. insertions, deletions, substitutions, identities) is also necessary to calculate a correct (unbiased) accuracy / error rate. (Using the length of...
> The current version of tesstrain requires users to run `make tesseract-langdata` before running the training. Older versions of tesstrain did not require this additional step which explains that there...
No idea. Note that `ocrd-fork-pycocotools` is https://github.com/bertsky/cocoapi – I only added a few fixes. Meanwhile, I have merged from upstream. Please try again now (I have made a release 2.0.6.post1)....
I don't know what `in-tree-build` is. So you are saying that MacOS works if you install manually? Or that you can pip install from the new src tarball on PyPI?
Ah, got it. Hard to tell from here. But could you try `python setup.py build_ext install` (which is in upstream's upstream)?
Another [thing](https://github.com/cocodataset/cocoapi/issues/473) you could try: setting `ARCHFLAGS="-arch x86_64"` during compilation.
That's strange indeed. It's not to be expected from the vanilla tesstrain rules (even the fast variant just does ConvertToInt). And the concrete wordlist looks very awkward (contains 400k fullforms,...