spec Metadata for OCR models and/or OCR model training sets

trafficstars

We need to define a set of metadata for OCR models including at least:

engine (inkl. version)
parameter setting for training
reference to OCR model training set
...

We need to define a set of metadata for OCR model training sets including at least:

information on the training materials
(output) character sets
license
...

Oct 12 '18 14:10 wrznr

@VolkerHartmann Relevant for GT repository as well as model repository. @bertsky Relevent for post correction.

Additions to metadata entries and proposals for representation format(s) very much welcome.

Oct 12 '18 14:10 wrznr

GT repository @bertsky: Which attributes will be important for the selection of GT records? I'm thinking of:

Font
Publication date
Print shop (?) (I have not seen this attribute yet but it could be helpful, couldn't it)
...

Model repository At the moment no collection (for training set) can be created and therefore not referenced. This feature is planned for future versions. Until then all pages/data have to be listed. How willl the parameter look like? (To be most generic a key-value implementation would be appropriate) information on the training materials: Part of the GT metadata. e.g. publishing date, language, fonts, ...

Oct 15 '18 08:10 VolkerHartmann

@VolkerHartmann Sorry, I am not so sure what it is you are asking me for. This issue is about OCR model meta-data, and I already find the list of features for that mentioned by @wrznr in the original post sufficient for post-correction purposes. Are you actually addressing #85 here? And what does "selection of GT records" refer to (the selection of features for GT meta-data records perhaps)?

Oct 15 '18 12:10 bertsky

If the list of features is sufficient, that's fine.

Oct 15 '18 13:10 VolkerHartmann

@VolkerHartmann In which format can necessary metadata be sufficiently (i.e. in a formal, machine-readable way) defined?

Nov 06 '18 08:11 wrznr

Most formats are easy to parse. I would prefer JSON or XML but key-value pairs are also ok if no hierarchy exists.

Nov 06 '18 08:11 VolkerHartmann

https://github.com/kba/ocr-models/blob/master/schema/description.schema.yml

Nov 06 '18 09:11 kba

@wrznr develops a proposal based on the above schema.

Nov 06 '18 10:11 wrznr

@wrznr Push.

Nov 13 '18 09:11 wrznr

Just to let you know that I've been told today that PMML is the widely accepted standard to describe ML models. It is XML-based. Perhaps we can learn/borrow some things from there.

Nov 13 '18 16:11 cneud

https://github.com/kba/ocr-models/blob/master/schema/description.schema.yml format: hdf5, pyrnn pronn,...

HDF5 is a container format but not a format of the model, right? It could contain any models. Is pyrnn a widely known standard extension? I can't find any information about that. We could add PMML as a possible format.

Landing page for the model or homepage of the creator

In most cases the creator will be an algorithm.

I am missing information about the underlying font, language variants (optional) to select the appropriate model. In addition I would prefer the model as defined in PMML: (see MODEL-ELEMENT) e.g.: "NeuralNetwork" and information on which algorithms it can be used for (Ok, KRAKEN is compatible with ocropus) Are there other algorithms we could use later? I think the format defined in description.schema.yml links both.

If the model is described in PMML a consumer have to support all variants? In the future, there could be importers and exporters for different algorithms. When the time comes, we can always store the models as PMML. :-)

Nov 16 '18 07:11 VolkerHartmann

What is the status on this? I've hacked together a zenodo-based thingy that I uses the metadata schema of the old repository but that is clearly insufficient.

If we're still on the schema proposed by @kba I would suggest some additions and changes. For one adding a field pointing to a training data set (by URL or PID) is somewhat important and putting in at least a CER measurement might also be prudent.

With regards to using PMML, I'm not sure how/if it is beneficial to describe OCR models on a functional level as all engines come with their own format, effectively making the model files opaque blobs. A functional description also doesn't aid in any way in model selection/implementation matching.

Jan 20 '19 16:01 mittagessen

@kba @tboenig @wrznr have a meeting on this issue next week. We'll get back to you asap.

Jan 20 '19 16:01 wrznr

https://github.com/kba/mollusc/blob/master/spec/training-schema.yml

Jan 29 '19 09:01 wrznr

The repository isn't public.

Jan 29 '19 09:01 mittagessen

@mittagessen See https://github.com/OCR-D/spec/pull/105

Jan 29 '19 10:01 kba

@kba Can we involve @Doreenruirui here? She has a specification ready, right?

Apr 16 '19 07:04 wrznr

@wrznr @kba @Doreenruirui This is pretty progressed https://github.com/Doreenruirui/okralact/tree/master/docs, https://github.com/Doreenruirui/okralact/tree/master/engines/schemas, no?

May 21 '19 22:05 cneud

Hi Clemens,

Yes, the schemas are designed according to the documentation of the parameters of each engine. They are mainly used to verify the parameters when a user upload a configuration file.

Best, Rui

Clemens Neudecker [email protected] 于2019年5月22日周三上午12:49写道：

@wrznr https://github.com/wrznr @kba https://github.com/kba @Doreenruirui https://github.com/Doreenruirui This is pretty close https://github.com/Doreenruirui/okralact/tree/master/docs, https://github.com/Doreenruirui/okralact/tree/master/engines/schemas, no?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/OCR-D/spec/issues/86?email_source=notifications&email_token=ACEQARBS6N5CAYONHECKUL3PWR36XA5CNFSM4F3GHRJKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODV5M3XQ#issuecomment-494587358, or mute the thread https://github.com/notifications/unsubscribe-auth/ACEQARGIDZ42DZDJTY73GGTPWR36XANCNFSM4F3GHRJA .

May 23 '19 09:05 Doreenruirui

See https://github.com/Calamari-OCR/calamari/blob/master/calamari_ocr/ocr/datasets/dataset.py for the base class of datasets (image+transcription tuples) in calamari

Jun 17 '19 15:06 kba

I would like to restart the discussion on this as I've got a scalable-ish model repository working but the metadata schema used right now is insufficient powerful (both for print and manuscripts). The current state is here. It is already designed in a way to support multiple recognition engines through a free-text field in a searchable property. Each engine would define their own identifiers, ideally with different suffixes for functionally different model types, so multi- or cross-engine software would be able to effectively filter for supported models.

Currently, there are two requirements missing:

proper automatic model selection support
reproducibility

My suggestion is to incorporate an opaque blob that encapsulates hyperparameters in a way that OCR engines or a third party software like okralact can re-instantiate a model from scratch. This allows us

For automatic model selection there should be the ability to encode script (already in there), transcription levels, some kind of validation/test loss/error curve(s), and references directly to training data (if publicly available) or at least source material. To incorporate the methods the FAU team have developed we should also incorporate some kind of global script type embedding. It might be advisable to allow multiple of these, as the FAU system is currently fairly specific to the material OCR-D concerns itself with while other people might have more specific embeddings.

Oct 23 '19 18:10 mittagessen

spec spec copied to clipboard

Metadata for OCR models and/or OCR model training sets

spec
spec copied to clipboard