flair icon indicating copy to clipboard operation
flair copied to clipboard

[Feature]: Latin NLP Model

Open ch-sander opened this issue 1 year ago • 2 comments

Problem statement

Classic languages such as Latin are mostly taking a back seat when it comes to NLP (for obvious reasons, though)

Solution

spaCy's model LatinCy has shown how nicely a Latin NLP model can perform. Is there any effort planned towards a Latin model within this project or any support in case a third party will aim for such a model?

Additional Context

No response

ch-sander avatar Jan 10 '24 08:01 ch-sander

Hi @ch-sander ,

I think this is a very useful feature request! After having a look at the spaCy model for Latin on the Model Hub, for PoS Tagging the following repos from Universal Dependencies are used:

As far as I can see, only UD_Latin-LLCT is directly supported in Flair:

https://github.com/flairNLP/flair/blob/ddf3bb3e44f2a68b32d532ae5438d71c4125e4ab/flair/datasets/treebanks.py#L542-L562

The other datasets can easily be added to Flair (I assigned issue to me).

For NER I was unfortunately not able to find the training dataset, that was used for LatinCy. I should be located here, but it is currently not available. So I am pinging @diyclassics for help on NER :)

When these resources are available and integrated into Flair, it should be very easy to train models on that. E.g. PoS Tagging and NER models can be trained with LMs like Latin BERT as backbone.

stefan-it avatar Jan 18 '24 01:01 stefan-it

This sounds awesome! Thanks!

It would be promising to also involve https://github.com/CIRCSE and their many efforts related to the LiLa project @passarom. If I'm right, they also included more Medieval Latin than @diyclassics's model.

ch-sander avatar Jan 18 '24 09:01 ch-sander