Clemens Neudecker
Clemens Neudecker
> code for doing the ALTO XML parsing Perhaps some of this code could be useful/repurposed: * [`alto_tools.py`](https://github.com/cneud/alto-tools/blob/master/alto_tools.py) * [`alto4pandas.py`](https://github.com/qurator-spk/mods4pandas/blob/master/qurator/mods4pandas/alto4pandas.py) * [stylesheets to convert from ALTO to different formats](https://github.com/cneud/ocr-conversion#alto)
Some initial input for the dataset card/README: * The dataset was produced by the project partners in the [Europeana Newspapers](http://www.europeana-newspapers.eu/) project (2012-2014) * A subset out of the newspaper collections...
> was there any re-ocr done in the past years Unfortunately no. We are currently finalizing a report where we compare the old OCR quality with the performance that can...
Here is a quick mapping from Europeana Dataset IDs to content providers | europeana-ID | library | |---|---| | 9200359 | [National Library of the Netherlands](https://www.europeana.eu/en/collections/topic/18-newspapers?page=1&qf=DATA_PROVIDER%3A%22National%20Library%20of%20the%20Netherlands%22&api=fulltext) | | 9200356 |...
> Are ALTO formats consistent across collections? Within Europeana Newspapers, all OCR xml files are consistent, in that they are all using ALTO schema [version 2.0](https://github.com/altoxml/schema/blob/master/v2/alto-2-0.xsd). > Info to include...
My current thinking is to allow the placement of ONE ``continue`` marker that can be set anywhere in the ``LOCATION`` column. Upon opening a TSV file that contains such a...
> neat could also store the position at save time Even better idea!
Due to being stuck on CentOS 7 with rather outdated nvidia drivers, we can only currently test up to Tensorflow
I must agree, calculation of reliable accuracy rates with wrong segmentation order is beyond the possibilities of `dinglehopper`. The sheer amount of possible segmentation classes/errors is escalating way too quickly!...
This would be very useful! Unfortunately it will only work for ALTO though, since for PAGE-XML there is no such provenance but one rather has to fallback on the METS...