Janneke van der Zwaan

Results 52 comments of Janneke van der Zwaan

The software provided is (very) experimental. The README specifies the installation process, and I documented as much as possible (for example about what the input data should look like (see...

* RETAS - Text alignment software and evaluation dataset - email to obtain - http://ciir.cs.umass.edu/downloads/ocr-evaluation/

OCR text, but no gold standard: https://github.com/marriott-library/collections-as-data

Thanks! The signature of wf.list_steps() changed, so, yes, you should do print(wf.list_steps()). Please note that the workflow is about preprocessing the vudnc data, this has nothing to do with the...

Unfortunately, ochre is not (yet) fit for training good ocr post-correction models. I plan to work on it in the future, but only as a hobby project. So no promises...

I think the workflow fails because of changes to nlppln. I'll try to see if I can fix that later. Alo, I really recommend to use a different dataset than...

Okay, it should work again. Be careful to read the updated documentation in the README. Also, don't forget to update nlppln. For future reference, this is the relevant commit: 9ee6d7cca72bb9bcd074e1843b12ceea122662ce

Actually, the chars are extracted from all text (train set, test set, and val set). Whether this is correct (fair) is open for discussion. It is probably more correct to...

The problem probably has to do with the fact that edlib expects a string instead of bytes. What version of Python are you using (edlib works best under Python 3)....

Probably this encoding fix was only necessary for Python 2.7 (which I still use). Thank you!