fairseq How to finetune wmt on your own data

How to finetune wmt on your own data

Open tatiana-iazykova opened this issue 1 year ago • 14 comments

Hi!

Recently I stumbled across your repo and wmt models. They showed pretty good results on my data out-of-the-box (I uploaded them via HuggingFace) but I failed to find any information about how to fine-tune them on your data. Especially the preprocessing step is giving me a lot of trouble.

My data is in .csv format with 2 columns: original and translation.

Thanks in advance!

Jun 29 '22 14:06 tatiana-iazykova

make .csv into raw text like this: original file name: {name}.{src_lang} i.e. train.en

1+1=?
1+2=?

translation file name: {name}.{tgt_lang} i.e. train.de

2
3

Anyway, it is one sentence per line and each line in source text file corresponds to those in target text file. then fairseq-preprocess --trainpref {path_to_data/name} --source-lang {src_lang} --target-lang {tgt_lang} --srcdict {path to your source lang vocab.txt} --tgtdict {path to your target lang vocab.txt} You may want to add other options, or you can emit srcdict,tgtdict. If you emit dict path, fairseq will train one based on your data.

No other way to prepare data in original fairseq. You can always write your own dataset however.

See https://github.com/facebookresearch/fairseq/tree/main/examples/translation for further steps. Or go one directory above and find an example suits your purpose.

Jun 30 '22 14:06 gmryu

Aren't those scripts intended for training your own model and not for fine-tuning an existing one?

Jun 30 '22 14:06 tatiana-iazykova

There is no difference in training a new one and fine-tuning an existing one. You only need to add --restore-file or --continue-once as they are explained here: https://fairseq.readthedocs.io/en/latest/command_line_tools.html#checkpoint

Jun 30 '22 15:06 gmryu

But don't I need a vocabulary files associated with models? If so, I'd be very grateful if it won't bother you, of course, if you could point me in the direction on where to find them :)

Jun 30 '22 15:06 tatiana-iazykova

Yes you need the original vocabulary files used by the checkpoint. Does the author share it publicly? If not, then no way to find them.

I did not understand your case. You have used fairseq-preprocess and generate, right? Or are you using the converted huggingface FSMT?

Jun 30 '22 15:06 gmryu

The downloaded .tar has dict.{lang}.txt inside.

Jun 30 '22 15:06 gmryu

Hello @gmryu, your previous answer helped me! But when I have some new self-defined special tokens in the dataset, how to preserve them during the fairseq-preprocess procedure? Or if this doesn't seem right, when to add the new special tokens? I was referring to IWSLT'14 German to English (Transformer) example under https://github.com/facebookresearch/fairseq/tree/main/examples/translation. It seems that the dataset has been tokenized before preprocessing. I am a novice and I have gone through relevant issues but found no result. I really appreciate your help.

Aug 10 '22 19:08 martianmartina

@martianmartina Hi. In short, you have to fix the tokenized data to contain your custom tokens correctly and add your custom tokens into dict.txt (the --srcdict if they appear in input, the --tgtdict if they appear in output), delete that many tokens if you need to keep the vocab size unchanged.

First of all, texts are tokenized before fairseq-preprocess. fairseq-preprocess does not tokenize and always requires data been whitespace splited already.

So, for example if you want [fact] [question] as two your custom tokens, and your raw input texts are:

[fact]1+1=2. [question]1+1?
[fact]1+2=3. [question]1+2?

By tokenization(subword, sentencepiece, etc.), they may become:

[ fact ] 1 + 1 = 2 . [ question ] 1 + 1 ?
[ fact ] 1 + 2 = 3 . [ question ] 1 + 2 ?

Then to fairseq, [ fact ] are now 3 tokens, which is not you want.

In other word, you have to write a python script to read tokenized files and fix wrongly tokenized custom tokens. In this case, by replacing "[ fact ]" to "[fact]" and the same to "[question]" (for this example, an text editor will do the jobs)

[fact] 1 + 1 = 2 . [question] 1 + 1 ?
[fact] 1 + 2 = 3 . [question] 1 + 2 ?

Then to fairseq, [fact] is one token now. But if [fact] is not listed in dict.{source_lang}.txt, [fact] is <unk>. Unfortunatelly, all you can do is to switch the least-frequent tokens (the last tokens in dict.txt) to your custom tokens.

This is what you can do if you want to use custom tokens with a pretrained model.

Aug 11 '22 07:08 gmryu

@gmryu Thank you so much! Your answer cleared most of my problems. I still wonder if there is a way to tell the tokenizer not to split my special tokens. My previous attempt was to modify the tokenization files of fairseq under fairseq/data/encoders/ as issue #1867 pointed out (although I am training from scratch) but it didn't work because as you mentioned fairseq-preprocess didn't tokenize at all. So under what commands will fairseq call those files under directory encoders? Or how to perform the tokenization in https://github.com/facebookresearch/fairseq/blob/main/examples/translation/prepare-iwslt14.sh on my own dataset? You could just give me some directions if it is too detailed and can be searched online easily. Thank you in advance.

Aug 12 '22 00:08 martianmartina

@gmryu Also if I train from scratch instead of finetuning on a pre-trained model, would there be anything I need to take care of instead of replacing dictionary entries with 0 or low count? Previously I hard-coded all my self-defined special tokens as the extra_special_tokens parameters in Dictionary` class in https://github.com/facebookresearch/fairseq/blob/b5a039c292facba9c73f59ff34621ec131d82341/fairseq/data/dictionary.py.

Aug 12 '22 01:08 martianmartina

@martianmartina fairseq-preprocess is actually {fairseq repository}/fairseq_cli/preprocess.py, if you search encode_fn, you find nothing. Yet if you search encode_fn decode_fn in the whole fairseq_cli, you find:

fairseq-generate has only decode_fn
fairseq-interactive has both encode_fn and decode_fn Those are where fairseq/data/encoders are used.

That prepare-iwslt14.sh, how should I put that, is not a feature of fairseq. The same can be said about those inside fairseq/data/encoders/. While there are .py inside, they all import and use libraries coming from github repositories else than fairseq.

So personally, I only used sentencepiece and I do not know how to change other tokenizers. It might be better to search google for those github repos.

Finally about training from scratch, you need

a satisfying vocabulary(dict.xx.txt)
texts which have most tokens inside the vocabulary
run codes correctly

A good vocabulary has enough tokens and frequent tokens are written at the top. But actually your satisfaction is all that matter. The count has no interaction with fairseq nor models. extra_special_tokens are just normal tokens. Simply put those tokens inside the dict.xx.txt and it is done. It does not matter whether they are put at top or the last line, or in the middle.

You mostly get dict.xx.txt comes from data or from other's projects. Then you make your texts correct. I guess I have to say it again. "You make your texts correct." The most important and exaughsting part.

Aug 12 '22 11:08 gmryu

@gmryu Thanks a ton! Nevertheless, I found that fairseq-preprocess would generate dict.xx.txt if you didn't specify existing ones. In that case, I guess I don't need to find one, right? And the one generated by fairseq-process should contain all the tokens inside the texts.

Aug 12 '22 12:08 martianmartina

@gmryu And may I ask for what do we need [--tokenizer {moses,nltk,space}] and --bpe{byte_bpe,bytes,characters,fastbpe,gpt2,bert,hf_byte_bpe,sentencepiece,subword_nmt}] these parameters in fairseq-preprocess?

Aug 12 '22 13:08 martianmartina

@martianmartina Yes, fairseq-preprocess creates a dict for you. There are --thresholdtgt --thresholdsrc --nwordstgt --nwordssrc --padding-factor that affect the created dict. See https://fairseq.readthedocs.io/en/latest/command_line_tools.html#Preprocessing

A small disadvantage is that dict is not "trained". It is literally decided by how texts are tokenized. If the data is "I like banana-pancake." Then there are literally I like banana-pancake . 4tokens as a vocabulary. It does not create banana, - nor cake. Thus, it is the tokenizer decided the vocabulary. (Also, if you use a public/famous vocabs, you can make an comparison.)

So as I wrote before and I recommend you to look for it:

fairseq-generate has only decode_fn

fairseq-interactive has both encode_fn and decode_fn Those are where fairseq/data/encoders are used.

for fairseq-preprocess those --tokenizer and --bpe are ignored. fairseq-train does so too. Those arguments appear because they are registered in fairseq/data/encoders/__init__.py, which appears during the chain of fairseq/__init__.py.

There are a lot of ignored arguments been registered. You can declare --lr-scheduler for preprocessing, which makes no sense. If you want to know what arguments are valid, search args. in that .py gives you a starting clue.

Aug 12 '22 13:08 gmryu

fairseq fairseq copied to clipboard

How to finetune wmt on your own data

fairseq
fairseq copied to clipboard