grobid Training Arabic Articles

I want to try to train GROBID to extract metadata from Arabic articles. I tried to generate training data as described in the documentation. However, I assume since the model had not previously seen any articles with this structure or with this language, the output files are almost completely wrong. It says in the documentation that xml files can't be generated and the stream of text can't be edited. So I was wondering what could be done in this case? Can I generate training data from scratch (based on the description from the documentation) and feed it to model? (The documentation says not to do that). My second question is , in theory, can training GROBID on non-english articles actually yeild any results especially with right-to-left languages? Thank you for you hard work.

Sep 01 '21 13:09 yasminaaq

Hello @yasminaaq !

Thanks for the interest in GROBID.

Indeed with Arabic (like Chinese currently), we will have a mess with the labeling because there is no annotated example in this language - layout features alone are not enough.

First, you can edit the generated pre-labeled training data, but it has to be limited to moving the tags, the stream of text content must be kept untouched because it has to be synchronized with the feature information (capturing all the layout stuff) when training.

There are a couple of important things in the current process:

you need to follow the cascade of models when annotating the model. So start with the segmentation model. When the overall segmentation is okay, start annotating next level models (header, full text, reference segmenters). Otherwise, the training data generated for the subsequence models will have wrong boundaries (and we can't edit the text stream).
you can either start from scratch (createTrainingBlank batch method), and put the labels on the "empty" TEI XML text or create pre-annotated training files with the existing English-dominated models (createTraining) and move/add/remove tags
after labeling manually a few examples (from scratch or pre-annotated), you can train a model with this new training data and then generate pre-annotated data, because working with training files from scratch is very tedious.
if you plan to use CRF models, you will need to extend the lexical/dictionary files with Arabic resources: grobid-home/lexicon/names/names.family -> family names grobid-home/lexicon/names/firstname.5k -> forenames grobid-home/lexicon/countries/CountryCodes.xml -> country names grobid-home/lexicon/places/location.txt -> some locations (towns, districts, etc.)

you can have a look at the class org.grobid.core.lexicon.Lexicon which is loading these gazetteers for the file locations.

For deep learning models, word embeddings for Arabic will need to be used.

My second question is , in theory, can training GROBID on non-english articles actually yeild any results especially with right-to-left languages? Thank you for you hard work.

I think this is okay in theory :) more precisely:

latest pdfalto supports Arabic fonts with a dedicated CMAP file
there's a basic tokenizer for Arabic which is selected automatically by the language recognizer
right-to-left language does not really matter, because when extracted the text is always normalized and has the same "unidirectional" form, whatever its presentation is

However, what I observed with patents in Arabic is that the reading order can be complicated to get correctly at word level, contrary to the other languages, where a re-ordering of the blocks is usually enough.

Sep 02 '21 05:09 kermitt2

@yasminaaq for what it is worth, i'm also interested in increasing GROBID training coverage for non-Latin-script languages, right-to-left languages, and Arabic specifically. I am not a reader or speaker of Arabic, so i'm a bit nervous about introducing bad data, but might be able to help anyways.

Sep 08 '21 18:09 bnewbold