ochre
ochre copied to clipboard
print error - ICDAR2017_shared_task_workflows.ipynb
Hi guys,
I suggest change print wf.list_steps() to print (wf.list_steps()) in the notebook ICDAR2017_shared_task_workflows.ipynb
Also, I would not able to run cwltool ochre/cwl/ICDAR2017_shared_task_workflows. That is what I have got: ochre/cwl/vudnc-preprocess-pack.cwl: error: argument --archive is required
Thanks! The signature of wf.list_steps() changed, so, yes, you should do print(wf.list_steps()).
Please note that the workflow is about preprocessing the vudnc data, this has nothing to do with the icdar 2017 shared task. Also, I do not recommend using the vudnc data, because it is very noisy. But if you do want to preprocess it anyway, you should do
cwltool ochre/cwl/vudnc-preprocess-pack.cwl --archive path/to/vudnc/archive
Thanks! The signature of wf.list_steps() changed, so, yes, you should do print(wf.list_steps()).
Please note that the workflow is about preprocessing the vudnc data, this has nothing to do with the icdar 2017 shared task. Also, I do not recommend using the vudnc data, because it is very noisy. But if you do want to preprocess it anyway, you should do
cwltool ochre/cwl/vudnc-preprocess-pack.cwl --archive path/to/vudnc/archive
You are correct. I meant that I was not able to run vudnc-preprocess-pack.cwl.
For good results in english, do you recommend using the english monograph partition of ICDAR? I trained with both monograph and the periodical partitions in separated but the validation accuracy and loss were not good (and also the tests I made).
I would like to help with some additional documentation to improve reproducibility, but I need a roadmap of how to get significant results (mainly for english documents).
Unfortunately, ochre is not (yet) fit for training good ocr post-correction models. I plan to work on it in the future, but only as a hobby project. So no promises there!
Generally speaking, the OCR post-correction datasets are small. That's why I'm making a list of them, so they can be used for generalization. I don't think that training on the English monograph data will give you a model that will work on other data, because OCR errors tend to depend on time period, font, the ocr software that was used, etc.