tesstrain
tesstrain copied to clipboard
Feat/generate trainingsets
Include generation of Trainingdata Sets from OCR like ALTO V3, PAGE 2013, PAGE 2019 and Image Files (tif, jpeg)
I've tested it now, unit tests pass and I managed to extract image-text pairs from the kant_aufklaerung_1784 sample in assets:
$ python3 ./generate_sets.py -d ../assets/data/kant_aufklaerung_1784/data/OCR-D-GT-PAGE/PAGE_0017_PAGE.xml -i ../assets/data/kant_aufklaerung_1784/data/OCR-D-IMG/INPUT_0017.tif
[SUCCESS] created '20' training data sets, please review
It would be useful to make -o
required or at least print the output directory as part of the SUCCESS
message.
Could the -i
argument be optional and by default be derived from imageFilename
(PAGE) / sourceImageInformation/filename
(ALTO)?
We also need a section on at least the CLI usage in the README.md
For the arabic text that is included as text resource (288652), and that's causing trouble with bidi, please see the original image (binarized)
@kba Do you know of any Devanagari or any other Indic language datasets in Page XML format? I only have scanned page images and and their groundtruth in text format. I don't think those will work with this PR.
@kba Do you know of any Devanagari or any other Indic language datasets in Page XML format? I only have scanned page images and and their groundtruth in text format. I don't think those will work with this PR.
Sorry, I do not. But maybe you have OCR results in Devanagari to test the mechanics of this PR? What problems do you foresee with Devanagari?
What problems do you foresee with Devanagari?
I don't foresee any, but wanted to test with complex scripts, just in case there is any difference in processing.
maybe you have OCR results in Devanagari to test the mechanics of this PR?
Good idea. I can test using ALTO output from tesseract.
Devanagari or any other Indic language datasets in Page XML format
I found a set of files at https://github.com/ramayanaocr/ocr-comparison/tree/master/Transkribus/Input, which has the png files as well as the xml files (generated by transkribus, I guess). I tested with one of those files, while the console messages reported success, the files were not created. The summary option created a file, but the file had empty lines.
tesstrain-extract-gt /home/ubuntu/ocr-comparison/Transkribus/Input/page/ram110.xml -i /home/ubuntu/ocr-comparison/Transkribus/Input/ram110.png
[INFO ] generate trainingsets of '/home/ubuntu/ocr-comparison/Transkribus/Input/page/ram110.xml' with '/home/ubuntu/ocr-comparison/Transkribus/Input/ram110.png' (min: 1, sum: False, reorder: False)
[SUCCESS] created '24' training data sets in 'training_data_ram110', please review
I tested with the Arabic image shared earlier in this thread with its xml file in resources, just to make sure that I had the PR installed correctly. That worked i.e. created the files. I haven't looked at the text within them.
tesstrain-extract-gt /home/ubuntu/tesstrain/tests/resources/xml/288652.xml -i /home/ubuntu/pagedeva/288652.png -o /home/ubuntu/pagedeva/output -s
[INFO ] generate trainingsets of '/home/ubuntu/tesstrain/tests/resources/xml/288652.xml' with '/home/ubuntu/pagedeva/288652.png' (min: 1, sum: True, reorder: False)
[SUCCESS] created '33' training data sets in '/home/ubuntu/pagedeva/output', please review
Is there a compatibility issue with transkribus generated PAGE files?
I tested just now with ALTO output from tesseract and get the following warnings:
tesstrain-extract-gt /home/ubuntu/tesstrain-San/test/iast/sandocs_2.xml -i /home/ubuntu/tesstrain-San/test/iast/sandocs_2.png -s
[INFO ] generate trainingsets of '/home/ubuntu/tesstrain-San/test/iast/sandocs_2.xml' with '/home/ubuntu/tesstrain-San/test/iast/sandocs_2.png' (min: 1, sum: True, reorder: False)
/home/ubuntu/miniforge3/lib/python3.7/site-packages/numpy/core/_methods.py:234: RuntimeWarning: Degrees of freedom <= 0 for slice
keepdims=keepdims)
/home/ubuntu/miniforge3/lib/python3.7/site-packages/numpy/core/_methods.py:195: RuntimeWarning: invalid value encountered in true_divide
arrmean, rcount, out=arrmean, casting='unsafe', subok=False)
/home/ubuntu/miniforge3/lib/python3.7/site-packages/numpy/core/_methods.py:226: RuntimeWarning: invalid value encountered in double_scalars
ret = ret.dtype.type(ret / rcount)
[SUCCESS] created '5' training data sets in 'training_data_sandocs_2', please review
EDIT: Earlier error with ALTO was because of typo in filename.
@Shreeshrii Thanks for pointing to PAGE-Files that miss `Word' elements at all!
- Since that was the cause for the missing results in the provided Devanagari sample. I tried to fix this and integrated the file as new test resource. Unfortunately, I can't say a word about the textual outcome, so please update the PR and have a look again ...
@M3ssman I tried just now but am getting the same result as before.
git log -3
commit 3fb94996ac42818b302850080a6f2535db12251e (HEAD -> pagesets)
Author: M3ssman <[email protected]>
Date: Sun Dec 13 10:44:47 2020 +0100
[app][fix] handle page without word elements
commit 2f3566bc23a848e3df7801b2fa1a6ce1d417e7bc
Author: M3ssman <[email protected]>
Date: Mon Dec 7 14:19:58 2020 +0100
[app][fix] filter invalid lines
commit 57ba229ace0c9ae74afb889916cba3555ef7b4d0
Author: M3ssman <[email protected]>
Date: Mon Dec 7 13:18:48 2020 +0100
[app][test] fix test imports
tesstrain-extract-gt /home/ubuntu/ocr-comparison/Transkribus/Input/page/ram110.xml -i /home/ubuntu/ocr-comparison/Transkribus/Input/ram110.png -s
[INFO ] generate trainingsets of '/home/ubuntu/ocr-comparison/Transkribus/Input/page/ram110.xml' with '/home/ubuntu/ocr-comparison/Transkribus/Input/ram110.png' (min: 1, sum: True, reorder: False)
[SUCCESS] created '24' training data sets in 'training_data_ram110', please review
However, only the summary file is created in 'training_data_ram110'. File is attached.
PS: I looked at the XML file and the Devanagari text in it has errors, so it is probably raw OCRed text and not corrected text for groundtruth.
I also tried with the ALTO 4.1 XML referenced in the issue I opened at https://github.com/OCR-D/ocrd_fileformat/issues/23 That fails with the following messages:
(base) ubuntu@tesseract-ocr-1:~/tesstrain-pagesets$ tesstrain-extract-gt /home/ubuntu/OCR_GS_Data/TypeFaces/persian_watts_typeface/data/ahsan_at_tavarikh_31.xml -i /home/ubuntu/OCR_GS_Data/TypeFaces/persian_watts_typeface/data/ahsan_at_tavarikh_31.png -s
[INFO ] generate trainingsets of '/home/ubuntu/OCR_GS_Data/TypeFaces/persian_watts_typeface/data/ahsan_at_tavarikh_31.xml' with '/home/ubuntu/OCR_GS_Data/TypeFaces/persian_watts_typeface/data/ahsan_at_tavarikh_31.png' (min: 1, sum: True, reorder: False)
Traceback (most recent call last):
File "/home/ubuntu/miniforge3/bin/tesstrain-extract-gt", line 8, in <module>
sys.exit(main())
File "/home/ubuntu/miniforge3/lib/python3.7/site-packages/generate_sets/cli.py", line 74, in main
reorder=REORDER)
File "/home/ubuntu/miniforge3/lib/python3.7/site-packages/generate_sets/training_sets.py", line 351, in create
self.xml_data, min_len=min_chars, reorder=reorder)
File "/home/ubuntu/miniforge3/lib/python3.7/site-packages/generate_sets/training_sets.py", line 184, in text_line_factory
ns_prefix = _determine_namespace(xml_data)
File "/home/ubuntu/miniforge3/lib/python3.7/site-packages/generate_sets/training_sets.py", line 223, in _determine_namespace
return [k for (k, v) in XML_NS.items() if v == root_tag][0]
IndexError: list index out of range
@Shreeshrii Thanks for pointing towards ALTO V4. I've missed this before, since we're using the latest official stable release, tesseract 4.1., which doesn't create this kind of ALTO data. I've added the ALTO V4 namespace declaration and it worked fine. Somehow, I found this surprising, since the ALTO V4 data from OpenITI you pointed out looks quite unfamiliar, having String CONTENT spanned over a complete textline. I've never seen this before. Where does this data come from?
Regarding the Devanagari Issue: Your git log looks well, the version matches. Maybe tesstrain-extract-gt
in your current, active environment is outdated, so please drop it and do a fresh install afterwards. You can also do a pytest -v
to run the so far included test cases (with their test datasets) and check the temporary outputs in your local /tmp/pytest-of-<account>
dir.
pytest -v
============================================================================================================ test session starts ============================================================================================================
platform linux -- Python 3.7.6, pytest-6.2.0, py-1.10.0, pluggy-0.13.1 -- /home/ubuntu/miniforge3/bin/python3.7
cachedir: .pytest_cache
rootdir: /home/ubuntu/tesstrain-pagesets
collected 8 items
tests/test_generate_sets.py::test_create_sets_from_alto_and_tif PASSED [ 12%]
tests/test_generate_sets.py::test_create_sets_from_page2013_and_jpg PASSED [ 25%]
tests/test_generate_sets.py::test_create_sets_from_page2013_and_jpg_no_summary PASSED [ 37%]
tests/test_generate_sets.py::test_create_sets_from_page2019_and_png PASSED [ 50%]
tests/test_generate_sets.py::test_create_sets_from_ocrd_workdspace PASSED [ 62%]
tests/test_generate_sets.py::test_create_sets_from_ocrd_workdspace_fails PASSED [ 75%]
tests/test_generate_sets.py::test_handle_invalid_coords PASSED [ 87%]
tests/test_generate_sets.py::test_handle_page_devanagari_with_texlines PASSED [100%]
============================================================================================================= warnings summary ==============================================================================================================
tests/test_generate_sets.py::test_create_sets_from_alto_and_tif
/home/ubuntu/miniforge3/lib/python3.7/site-packages/numpy/core/_methods.py:234: RuntimeWarning: Degrees of freedom <= 0 for slice
keepdims=keepdims)
tests/test_generate_sets.py::test_create_sets_from_alto_and_tif
/home/ubuntu/miniforge3/lib/python3.7/site-packages/numpy/core/_methods.py:195: RuntimeWarning: invalid value encountered in true_divide
arrmean, rcount, out=arrmean, casting='unsafe', subok=False)
tests/test_generate_sets.py::test_create_sets_from_alto_and_tif
/home/ubuntu/miniforge3/lib/python3.7/site-packages/numpy/core/_methods.py:226: RuntimeWarning: invalid value encountered in double_scalars
ret = ret.dtype.type(ret / rcount)
-- Docs: https://docs.pytest.org/en/stable/warnings.html
====================================================================================================== 8 passed, 3 warnings in 12.88s =======================================================================================================
(base) ubuntu@tesseract-ocr-1:~/tesstrain-pagesets$ tesstrain-extract-gt /home/ubuntu/ocr-comparison/Transkribus/Input/page/ram110.xml -i /home/ubuntu/ocr-comparison/Transkribus/Input/ram110.png -s
[INFO ] generate trainingsets of '/home/ubuntu/ocr-comparison/Transkribus/Input/page/ram110.xml' with '/home/ubuntu/ocr-comparison/Transkribus/Input/ram110.png' (min: 1, sum: True, reorder: False)
[SUCCESS] created '24' training data sets in 'training_data_ram110', please review
(base) ubuntu@tesseract-ocr-1:~/tesstrain-pagesets$ ls -l training_data_ram110
total 4
-rw-rw-r-- 1 ubuntu ubuntu 24 Dec 15 04:20 ram110_summary.gt.txt
The files are generated as part of the test:
(base) ubuntu@tesseract-ocr-1:/tmp/pytest-of-ubuntu/pytest-current/test_handle_page_devanagari_wicurrent$ ls -l
total 34492
-rw-rw-r-- 1 ubuntu ubuntu 22618835 Dec 15 04:18 ram110.png
-rw-rw-r-- 1 ubuntu ubuntu 2515 Dec 15 04:18 ram110_summary.gt.txt
-rw-rw-r-- 1 ubuntu ubuntu 187 Dec 15 04:18 ram110_tl_10.gt.txt
-rw-rw-r-- 1 ubuntu ubuntu 603624 Dec 15 04:18 ram110_tl_10.tif
-rw-rw-r-- 1 ubuntu ubuntu 37 Dec 15 04:18 ram110_tl_11.gt.txt
-rw-rw-r-- 1 ubuntu ubuntu 266846 Dec 15 04:18 ram110_tl_11.tif
-rw-rw-r-- 1 ubuntu ubuntu 117 Dec 15 04:18 ram110_tl_12.gt.txt
-rw-rw-r-- 1 ubuntu ubuntu 550042 Dec 15 04:18 ram110_tl_12.tif
-rw-rw-r-- 1 ubuntu ubuntu 108 Dec 15 04:18 ram110_tl_13.gt.txt
-rw-rw-r-- 1 ubuntu ubuntu 601434 Dec 15 04:18 ram110_tl_13.tif
-rw-rw-r-- 1 ubuntu ubuntu 151 Dec 15 04:18 ram110_tl_14.gt.txt
-rw-rw-r-- 1 ubuntu ubuntu 651804 Dec 15 04:18 ram110_tl_14.tif
-rw-rw-r-- 1 ubuntu ubuntu 102 Dec 15 04:18 ram110_tl_15.gt.txt
-rw-rw-r-- 1 ubuntu ubuntu 520708 Dec 15 04:18 ram110_tl_15.tif
-rw-rw-r-- 1 ubuntu ubuntu 102 Dec 15 04:18 ram110_tl_16.gt.txt
-rw-rw-r-- 1 ubuntu ubuntu 516418 Dec 15 04:18 ram110_tl_16.tif
-rw-rw-r-- 1 ubuntu ubuntu 107 Dec 15 04:18 ram110_tl_17.gt.txt
-rw-rw-r-- 1 ubuntu ubuntu 745854 Dec 15 04:18 ram110_tl_17.tif
-rw-rw-r-- 1 ubuntu ubuntu 148 Dec 15 04:18 ram110_tl_18.gt.txt
-rw-rw-r-- 1 ubuntu ubuntu 615958 Dec 15 04:18 ram110_tl_18.tif
-rw-rw-r-- 1 ubuntu ubuntu 157 Dec 15 04:18 ram110_tl_19.gt.txt
-rw-rw-r-- 1 ubuntu ubuntu 560244 Dec 15 04:18 ram110_tl_19.tif
-rw-rw-r-- 1 ubuntu ubuntu 43 Dec 15 04:18 ram110_tl_1.gt.txt
-rw-rw-r-- 1 ubuntu ubuntu 127490 Dec 15 04:18 ram110_tl_1.tif
-rw-rw-r-- 1 ubuntu ubuntu 106 Dec 15 04:18 ram110_tl_20.gt.txt
-rw-rw-r-- 1 ubuntu ubuntu 561928 Dec 15 04:18 ram110_tl_20.tif
-rw-rw-r-- 1 ubuntu ubuntu 107 Dec 15 04:18 ram110_tl_21.gt.txt
-rw-rw-r-- 1 ubuntu ubuntu 662002 Dec 15 04:18 ram110_tl_21.tif
-rw-rw-r-- 1 ubuntu ubuntu 127 Dec 15 04:18 ram110_tl_22.gt.txt
-rw-rw-r-- 1 ubuntu ubuntu 548104 Dec 15 04:18 ram110_tl_22.tif
-rw-rw-r-- 1 ubuntu ubuntu 115 Dec 15 04:18 ram110_tl_23.gt.txt
-rw-rw-r-- 1 ubuntu ubuntu 704092 Dec 15 04:18 ram110_tl_23.tif
-rw-rw-r-- 1 ubuntu ubuntu 17 Dec 15 04:18 ram110_tl_24.gt.txt
-rw-rw-r-- 1 ubuntu ubuntu 105892 Dec 15 04:18 ram110_tl_24.tif
-rw-rw-r-- 1 ubuntu ubuntu 6 Dec 15 04:18 ram110_tl_2.gt.txt
-rw-rw-r-- 1 ubuntu ubuntu 32458 Dec 15 04:18 ram110_tl_2.tif
-rw-rw-r-- 1 ubuntu ubuntu 137 Dec 15 04:18 ram110_tl_3.gt.txt
-rw-rw-r-- 1 ubuntu ubuntu 741314 Dec 15 04:18 ram110_tl_3.tif
-rw-rw-r-- 1 ubuntu ubuntu 145 Dec 15 04:18 ram110_tl_4.gt.txt
-rw-rw-r-- 1 ubuntu ubuntu 712610 Dec 15 04:18 ram110_tl_4.tif
-rw-rw-r-- 1 ubuntu ubuntu 36 Dec 15 04:18 ram110_tl_5.gt.txt
-rw-rw-r-- 1 ubuntu ubuntu 337642 Dec 15 04:18 ram110_tl_5.tif
-rw-rw-r-- 1 ubuntu ubuntu 99 Dec 15 04:18 ram110_tl_6.gt.txt
-rw-rw-r-- 1 ubuntu ubuntu 495246 Dec 15 04:18 ram110_tl_6.tif
-rw-rw-r-- 1 ubuntu ubuntu 97 Dec 15 04:18 ram110_tl_7.gt.txt
-rw-rw-r-- 1 ubuntu ubuntu 581738 Dec 15 04:18 ram110_tl_7.tif
-rw-rw-r-- 1 ubuntu ubuntu 103 Dec 15 04:18 ram110_tl_8.gt.txt
-rw-rw-r-- 1 ubuntu ubuntu 518032 Dec 15 04:18 ram110_tl_8.tif
-rw-rw-r-- 1 ubuntu ubuntu 137 Dec 15 04:18 ram110_tl_9.gt.txt
-rw-rw-r-- 1 ubuntu ubuntu 761348 Dec 15 04:18 ram110_tl_9.tif
-rw-rw-r-- 1 ubuntu ubuntu 22586 Dec 15 04:18 ram110.xml
How do I ensure that latest tesstrain-extract-gt
is being used?
The image should look like the following. But, in /tmp/pytest-of-ubuntu/pytest-current/test_handle_page_devanagari_wicurrent
the png file as well as the generated tifs have ??? rather than the Devanagari text as per image.
The generated gt.txt is correct (i.e. it is in Devanagari script) but the images are not.
ALTO V4 data from OpenITI you pointed out looks quite unfamiliar, having String CONTENT spanned over a complete textline. I've never seen this before. Where does this data come from?
I do not know more than the info available online. Please see https://github.com/OpenITI/RELEASE and https://zenodo.org/record/4075046#.X9hC0dgzaUk
@Shreeshrii Please note, test images are just created on-the-fly, with a library that is out-of-the-box just able to render a very small subset of UTF-8 chars, I guess only ASCII, neither arabic, persian, devanagari or old german fracture letters. This was introduced to keep test data small and free from binary image stuff. It only gives you a hint whether the lines would match the "words".
@Shreeshrii Regarding the lastest version: currently, there's only a-pre-beta-version (0.0.1) annotated in the setup.py
. Usually this would be the place to follow versioning. I do not know how to utilize some sort of repository information straight at this point. Maybe @kba can give us a hint?
@M3ssman Thanks for the explanations regarding test files.
Maybe tesstrain-extract-gt in your current, active environment is outdated, so please drop it and do a fresh install afterwards.
You were right about this.
I removed tesstrain-extract-gt
from the bin
directories and reinstalled in the environment where ocrd
is installed. It works now. All the tif and gt.txt were created for the Transkribus Devanagari file.
The alto4.1 Persian file is also generating line images and text. (I haven't checked regarding the RTL issue yet).
This is great!! Thank you.
@Shreeshrii You're welcome!
... Sorry for the confusion regarding RTL ... finally, it turned out that the -r
flag aims at something different than real RTL which can be handled with py-bidi
. If active, it only re-arranges word tokens by top-left-corner in descending order, starting from right margin. Therefore I renamed it to --reorder
. It doesn't turn characters. I had to deal with arabic PAGE-XML exported from Transkribus, having inconsistent reading-orders and display artifacts and almost made me go crazy.
Since this relies on individual coordinates for each token, I'm afraid it will have no effect on test resources like the ones gathered from OpenITI which only have a single String@CONTENT
element that represents a text line in total (or at least more than just one word). Reordering this way requires proper coordinates below text line level: We can't just chop the lines and reorder tokens, since the source order of elements of a plain text line is certainly not always reliable.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This should not be closed. It needs review by someone familiar with RTL languages.
This pull request introduces 4 alerts when merging f3e73e47ca18d09ee6ba2ed3a5ea16b3f3c33620 into fa57d619e239694b9d4073eaf5b9150d0b4fae68 - view on LGTM.com
new alerts:
- 3 for __init__ method calls overridden method
- 1 for 'import *' may pollute namespace
I've been talking with https://github.com/galdring , a colleague, about this review and he's out to get us somebody.
@M3ssman, please check git config user.name
. Your commits use that name for the author information.
There's also a branch with the same name (feat/generate-trainingsets
) but outdated already in this repository, which I guess @kba created to commit his extensions before I integrated them and they finally went to ulb-sachsen-anhalt/tesstrain/tree/feat/generate-trainingsets
.
I wonder if this causes any irritations?
This pull request introduces 4 alerts when merging 21c718f6140ba366c68d9194509f93205717b705 into fa57d619e239694b9d4073eaf5b9150d0b4fae68 - view on LGTM.com
new alerts:
- 3 for __init__ method calls overridden method
- 1 for 'import *' may pollute namespace
I wonder if this causes any irritations?
I don't think so but I deleted the branch since it is outdated as you say.
This pull request introduces 4 alerts when merging 23edc0685cd62c760849b6e288a58a7c9b991733 into fa57d619e239694b9d4073eaf5b9150d0b4fae68 - view on LGTM.com
new alerts:
- 3 for __init__ method calls overridden method
- 1 for 'import *' may pollute namespace
This pull request introduces 4 alerts when merging ea8464bc779986d9ca9dd9d28e59f2e392c9e3ea into 0d972f86f4aaf88fde77e3445ff607e68866c882 - view on LGTM.com
new alerts:
- 3 for `__init__` method calls overridden method
- 1 for 'import *' may pollute namespace
This pull request introduces 4 alerts when merging 325d7942a516c3c980846459f2bcba2971aae59d into 0d972f86f4aaf88fde77e3445ff607e68866c882 - view on LGTM.com
new alerts:
- 3 for `__init__` method calls overridden method
- 1 for 'import *' may pollute namespace
This pull request introduces 4 alerts when merging cf54dd9f73b94df92af177baa70a22307473fd70 into 0d972f86f4aaf88fde77e3445ff607e68866c882 - view on LGTM.com
new alerts:
- 3 for `__init__` method calls overridden method
- 1 for 'import *' may pollute namespace