trankit
trankit copied to clipboard
Finetuning Trankit for POS on a new language
Hi, I've looked through the documentation on building a customized pipeline + looked through the example for NER for German. However, I do not understand, whether it is finetuning, that is, starting from an already pretrained Trankit model + Adding a new language, or is it training from scratch?
In particular, I am interested in adding a new language, say, Thai for Part-of-Speech tagging. My plan is to download the th_pud-ud-test.conllu
from https://github.com/UniversalDependencies/UD_Thai-PUD, and finetune the existing model to accept this language as well.
1 - I am not sure where to start from 2 - whether I will keep the quality of POS for other languages 3 - how to make this retrained/finetuned model with Thai language be available in "Auto Mode for multilingual pipelines"?
Thank you!
Hey Ph.D. Oleg Polivin, you can decide whether you freeze the pretained xlm-r-base or xlm-r-large model and fine-tune only the adapters for your new tasks. The pretrained model and the other adapters aren't altered. This Thai UD corpus has no data about lemmas. We could simply add some pairs of token copies. For example the second sentence in the corpus with the short tokens also as lemmas in the third column:
# sent_id = n01001013
# text = สำหรับผู้ที่ติดตามการเปลี่ยนผ่านโซเชียลมีเดียในแคปิตอลฮิล เรื่องนี้จะแตกต่างไปเล็กน้อย
# text_en = For those who follow social media transitions on Capitol Hill, this will be a little different.
1 สำหรับ _ ADP IN _ 2 case _ SpaceAfter=No
2 ผู้ ผู้ NOUN NN _ 13 obl _ SpaceAfter=No
3 ที่ ที่ DET WDT _ 4 nsubj _ SpaceAfter=No
4 ติดตาม _ VERB VV _ 2 acl:relcl _ SpaceAfter=No
5 การเปลี่ยนผ่าน _ VERB VV _ 4 obj _ SpaceAfter=No
6 โซเชียล _ ADJ JJ _ 7 amod _ Proper=True|SpaceAfter=No
7 มีเดีย _ NOUN NN _ 5 obj _ SpaceAfter=No
8 ใน ใน ADP IN _ 9 case _ SpaceAfter=No
9 แคปิตอลฮิล _ PROPN NNP _ 5 obl _ _
10 เรื่อง _ NOUN NN _ 13 nsubj _ SpaceAfter=No
11 นี้ นี้ DET DT _ 10 det _ SpaceAfter=No
12 จะ จะ AUX MD _ 13 aux _ SpaceAfter=No
13 แตกต่าง _ VERB VV _ 0 root _ SpaceAfter=No
14 ไป ไป PART RP _ 13 compound:prt _ SpaceAfter=No
15 เล็กน้อย _ ADV RB _ 13 advmod _ _
In each part of the train and dev file should be at least some lemma information to train the whole "customized-mwt" pipeline from https://trankit.readthedocs.io/en/latest/training.html You can run this python3 script to split the Thau-Corpus into a train and dev part:
f = open("th_pud-ud-test.conllu","r")
lines = f.readlines()
sentence_paragraphs = []
conllu_sentence = ""
for line in lines:
if line == "\n":
sentence_paragraphs.append(conllu_sentence)
conllu_sentence = ""
else:
conllu_sentence += line
thai_train = open("th_train.conllu", "w")
thai_dev = open("th_dev.conllu", "w")
for number,paragraph in enumerate(sentence_paragraphs):
if number < 100:
thai_dev.write(paragraph)
thai_dev.write("\n")
else:
thai_train.write(paragraph)
thai_train.write("\n")
And use the perl script from https://github.com/UniversalDependencies/tools to get also the raw text:
perl conllu_to_text.pl --lang th < th_train.conllu > th_train.txt
perl conllu_to_text.pl --lang th < th_dev.conllu > th_dev.txt
Then insert the path of the trainings data folder into the configuration like in this notebook: https://colab.research.google.com/drive/1wqkDPx4LGBE8qP9dU8Z3oNHBIscXPrAZ?usp=sharing
Greetings from the translation space https://bachstelze.gitlab.io/multisource/
@Bachstelze Thanks a lot, this is amazing!