flair
flair copied to clipboard
[Feature]: Latin NLP Model
Problem statement
Classic languages such as Latin are mostly taking a back seat when it comes to NLP (for obvious reasons, though)
Solution
spaCy's model LatinCy has shown how nicely a Latin NLP model can perform. Is there any effort planned towards a Latin model within this project or any support in case a third party will aim for such a model?
Additional Context
No response
Hi @ch-sander ,
I think this is a very useful feature request! After having a look at the spaCy model for Latin on the Model Hub, for PoS Tagging the following repos from Universal Dependencies are used:
As far as I can see, only UD_Latin-LLCT
is directly supported in Flair:
https://github.com/flairNLP/flair/blob/ddf3bb3e44f2a68b32d532ae5438d71c4125e4ab/flair/datasets/treebanks.py#L542-L562
The other datasets can easily be added to Flair (I assigned issue to me).
For NER I was unfortunately not able to find the training dataset, that was used for LatinCy. I should be located here, but it is currently not available. So I am pinging @diyclassics for help on NER :)
When these resources are available and integrated into Flair, it should be very easy to train models on that. E.g. PoS Tagging and NER models can be trained with LMs like Latin BERT as backbone.
This sounds awesome! Thanks!
It would be promising to also involve https://github.com/CIRCSE and their many efforts related to the LiLa project @passarom. If I'm right, they also included more Medieval Latin than @diyclassics's model.