trankit icon indicating copy to clipboard operation
trankit copied to clipboard

Finetuning Trankit for POS on a new language

Open olegpolivin opened this issue 3 years ago • 2 comments

Hi, I've looked through the documentation on building a customized pipeline + looked through the example for NER for German. However, I do not understand, whether it is finetuning, that is, starting from an already pretrained Trankit model + Adding a new language, or is it training from scratch?

In particular, I am interested in adding a new language, say, Thai for Part-of-Speech tagging. My plan is to download the th_pud-ud-test.conllu from https://github.com/UniversalDependencies/UD_Thai-PUD, and finetune the existing model to accept this language as well.

1 - I am not sure where to start from 2 - whether I will keep the quality of POS for other languages 3 - how to make this retrained/finetuned model with Thai language be available in "Auto Mode for multilingual pipelines"?

Thank you!

olegpolivin avatar Sep 08 '21 15:09 olegpolivin

Hey Ph.D. Oleg Polivin, you can decide whether you freeze the pretained xlm-r-base or xlm-r-large model and fine-tune only the adapters for your new tasks. The pretrained model and the other adapters aren't altered. This Thai UD corpus has no data about lemmas. We could simply add some pairs of token copies. For example the second sentence in the corpus with the short tokens also as lemmas in the third column:

# sent_id = n01001013
# text = สำหรับผู้ที่ติดตามการเปลี่ยนผ่านโซเชียลมีเดียในแคปิตอลฮิล เรื่องนี้จะแตกต่างไปเล็กน้อย
# text_en = For those who follow social media transitions on Capitol Hill, this will be a little different.
1	สำหรับ	_	ADP	IN	_	2	case	_	SpaceAfter=No
2	ผู้	ผู้	NOUN	NN	_	13	obl	_	SpaceAfter=No
3	ที่	ที่	DET	WDT	_	4	nsubj	_	SpaceAfter=No
4	ติดตาม	_	VERB	VV	_	2	acl:relcl	_	SpaceAfter=No
5	การเปลี่ยนผ่าน	_	VERB	VV	_	4	obj	_	SpaceAfter=No
6	โซเชียล	_	ADJ	JJ	_	7	amod	_	Proper=True|SpaceAfter=No
7	มีเดีย	_	NOUN	NN	_	5	obj	_	SpaceAfter=No
8	ใน	ใน	ADP	IN	_	9	case	_	SpaceAfter=No
9	แคปิตอลฮิล	_	PROPN	NNP	_	5	obl	_	_
10	เรื่อง	_	NOUN	NN	_	13	nsubj	_	SpaceAfter=No
11	นี้	นี้	DET	DT	_	10	det	_	SpaceAfter=No
12	จะ	จะ	AUX	MD	_	13	aux	_	SpaceAfter=No
13	แตกต่าง	_	VERB	VV	_	0	root	_	SpaceAfter=No
14	ไป	ไป	PART	RP	_	13	compound:prt	_	SpaceAfter=No
15	เล็กน้อย	_	ADV	RB	_	13	advmod	_	_

In each part of the train and dev file should be at least some lemma information to train the whole "customized-mwt" pipeline from https://trankit.readthedocs.io/en/latest/training.html You can run this python3 script to split the Thau-Corpus into a train and dev part:

f = open("th_pud-ud-test.conllu","r")
lines = f.readlines()
sentence_paragraphs = []
conllu_sentence = ""
for line in lines:
    if line == "\n":
        sentence_paragraphs.append(conllu_sentence)
        conllu_sentence = ""
    else:
        conllu_sentence += line

thai_train = open("th_train.conllu", "w")
thai_dev = open("th_dev.conllu", "w")

for number,paragraph in enumerate(sentence_paragraphs):
    if number < 100:
        thai_dev.write(paragraph)
        thai_dev.write("\n")
    else:
        thai_train.write(paragraph)
        thai_train.write("\n")

And use the perl script from https://github.com/UniversalDependencies/tools to get also the raw text:

perl conllu_to_text.pl --lang th < th_train.conllu > th_train.txt
perl conllu_to_text.pl --lang th < th_dev.conllu > th_dev.txt

Then insert the path of the trainings data folder into the configuration like in this notebook: https://colab.research.google.com/drive/1wqkDPx4LGBE8qP9dU8Z3oNHBIscXPrAZ?usp=sharing

Greetings from the translation space https://bachstelze.gitlab.io/multisource/

Bachstelze avatar May 31 '22 05:05 Bachstelze

@Bachstelze Thanks a lot, this is amazing!

olegpolivin avatar Jul 28 '22 09:07 olegpolivin