flair icon indicating copy to clipboard operation
flair copied to clipboard

Ukrainian language support in Flair

Open alanakbik opened this issue 3 years ago • 7 comments

This issue tracks the progress of adding support for the Ukrainian language from lang-uk to Flair. We would like to add:

  • [x] Ukrainian Flair embeddings trained by @dchaplinsky and available here: forward and backward. Should be made loadable with embeddings = FlairEmbeddings('uk-forward')and embeddings = FlairEmbeddings('uk-backward')
  • [x] Ukrainian NER by @dchaplinsky, available here. Should be made loadable with tagger = SequenceTagger.load('ner-ukrainian')
  • [x] Ukrainian part-of-speech tagger by @dchaplinsky, available here. Should be made loadable with tagger = SequenceTagger.load('pos-ukrainian')
  • [x] Ukrainian NER dataset described here. Loadable as corpus = NER_UKRAINIAN(). Should be integrated only once version 2.0 is complete.
  • [x] Ukrainian Universal Dependency Treebank, loadable as corpus = UD_UKRAINIAN().

alanakbik avatar Nov 08 '22 20:11 alanakbik

This is the code for the NER corpus I've used: https://github.com/lang-uk/flair-ner/blob/main/train_base.py#L32

and the code for the POS corpus: https://github.com/lang-uk/flair-pos/blob/main/train_grid.py#L21

I'll take a look if I have fixed split for ner hosted somewhere else

dchaplinsky avatar Nov 08 '22 20:11 dchaplinsky

Really cool idea!

I had to do a lot of manual preprocessing steps to get NER working when evaluating the ELECTRA model:

https://github.com/stefan-it/ukrainian-electra/blob/main/download_prepare_data_ner.sh

stefan-it avatar Dec 06 '22 22:12 stefan-it

Oh, @stefan-it thanks for reminding me. Totally forgot about fixed split.

On a separate topic. Would you like to try to train electra on a better quality ukrainian texts?

dchaplinsky avatar Dec 07 '22 09:12 dchaplinsky

Hey @dchaplinsky , I currently have access to TPUs, so if you have texts available I would love to pretrain another model :hugs:

stefan-it avatar Dec 07 '22 10:12 stefan-it

Yes I do! Could you contact me at chaplinsky[dot]dmitry on gmail?

dchaplinsky avatar Dec 07 '22 11:12 dchaplinsky

Hi @alanakbik and @stefan-it

I've just uploaded two bigger models for the Ukrainian language: https://huggingface.co/lang-uk/flair-uk-forward-large https://huggingface.co/lang-uk/flair-uk-backward-large

Those has hidden_size=2048 (in contrast to the 1024 of the original ones) and trained on my data + data from Stefan (54gb in total).

I've also trained a downstream NER model on them, and got a nice 1.5% improvement over the previous one, will publish it shortly.

dchaplinsky avatar May 08 '23 17:05 dchaplinsky

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Sep 17 '23 01:09 stale[bot]