text2text Fine-tune crosslingual model for language detection

Two approaches to try:

Use crosslingual embeddings as input to MLP or tree-based model in transfer learning fashion
Fine-tune crosslingual translator with softmax output

Jan 01 '22 19:01 artitw

I'd get started on this

Jan 03 '22 21:01 Mofetoluwa

Awesome, I've assigned you to this project. Let's keep track of progress here.

Jan 03 '22 22:01 artitw

Alright, sure :)

Jan 03 '22 22:01 Mofetoluwa

Hi Artit @artitw

Please I need your help, I’m facing some roadblocks.

I decided to start with the second approach you suggested, which is Fine-tuning the cross-lingual translator with a softmax output. My thought process for this:

Get a language detection dataset. I decided to work with wili_2018 (https://arxiv.org/pdf/1801.07779v1.pdf, https://zenodo.org/record/841984#.Yd1XWRPMK3I). What do you think about it, and do you have any other datasets in mind?
Looking at the Fitter class in the code, I'm not exactly sure how I can fine-tune the translator to train on the identification dataset. Since it requires both the source and target languages, I was thinking of writing another module to do the fine-tuning but was not sure if that was necessary? Do you have any ideas on how I can apply the softmax output to the already existing translator?

Also, does this sound like the right track for approach number 2?

Jan 11 '22 20:01 Mofetoluwa

Very much appreciate the updates on this. The dataset you cite looks appropriate; I suggest filtering for the languages which the pretrained model supports for tokenization.

Your ideas on the second approach seem fine so far. Yes, you are correct that you would have to write another module to finetune with a softmax output. I expect this second approach to be more challenging for this reason. If it helps, consider taking the first approach to get things working and then come back to the second approach to get better performance.

Jan 12 '22 01:01 artitw

Alright then, I'll get started with approach 1

Jan 12 '22 11:01 Mofetoluwa

Hi Art @artitw

Here is the link to the notebook: https://colab.research.google.com/drive/1VxRRURRAaXBZFsYsXC5hSTkSc-4TGdOj?usp=sharing

Based on the last discussion:

I decided to work with a batch size of 5, but the session took too long and would stop while going through the dataset.
So I decided to divide the dataset, create embeddings for each division and save it to a file.
I'm still doing that, as there is still some crashing. But there are currently embeddings for 2200 samples which was used for training.
The models trained are very simple since the size of the data used is small.

I would add more embeddings and retrain the models to see how performance improves.

What do you think about it? Thank you :)

Feb 07 '22 19:02 Mofetoluwa

Hi @Mofetoluwa

Thanks for all the work and the summary. It looks like the MLP model is best performing. Would you be able to add it to the repo? It would be great if language classification is available for use while we continue improving it with finetuning and other methods.

Feb 12 '22 02:02 artitw

Hi @artitw

Alright then :)

So just to clarify, how would we want to use the model on the repo? Asides from pushing the saved model, are we also creating a module for it to be called e.g identification.py, like the other functionalities?

Feb 13 '22 11:02 Mofetoluwa

It would be awesome to create an Identifier class so that we can do sonething like t2t.Handler(["test text"]).identify(). Could we give that a try?

Feb 13 '22 20:02 artitw

Alright, I'll add the model to the repo first. In which of the folders should I put it?

Feb 14 '22 16:02 Mofetoluwa

Can we store the model in somewhere like Google Drive and only download it when the Identifier is used? This approach would follow the existing convention to keep the core library lightweight.

Feb 15 '22 02:02 artitw

Alright then :)

Feb 15 '22 10:02 Mofetoluwa

Hi @artitw

My sincere apologies the updates are just coming in, I wanted to have done some work on the Identifier class before sending the updates.

A pull request has been made for the code, so you can have a look at it and let me know your thoughts.
More embeddings were created (~ 6900) and used to train the MLP model, so there has been some improvement as seen in the notebook: https://colab.research.google.com/drive/1Cq1lnDJMI2-ZZxm1VmCWzvLL78E2UmVy?usp=sharing
One limitation this current model has, is that it doesn't correctly predict the language for some short text sequences (less than 10 tokens). But it correctly identifies the language for longer sequences. Hopefully, this can be resolved using the second approach, and I'll also appreciate your thoughts on this problem too.

Thank you :)

Feb 27 '22 16:02 Mofetoluwa

@Mofetoluwa thanks for the updates and the pull request. I added some comments there. With regards to the third point you raise, when I tested the model, it returned "hy" for "hello" and "ja" for "你好!". Is this consistent with your testing as well?

Feb 27 '22 23:02 artitw

Yeah, it's a problem I noticed with most languages.

I believe approach 2 would resolve this? Another thing could be to generate shorter texts for this approach. What do you think?

Feb 28 '22 09:02 Mofetoluwa

I think training with shorter texts and approach 2 would address the issue. Another approach us to use 2D embeddings. Currently we are using 1D embeddings, which are calculate by averaging the last layer outputs, but we can use the last layer outputs directly as 2D embeddings.

I also just realized from adding the 2D embeddings option in the latest release that the last layer averaging could be improved by removing the paddings from the calculations. In other words, I think it might be helpful to re-train the MLP identification model on the latest release.

Mar 06 '22 01:03 artitw

Hi @artitw

Oh that sounds great. So how can the 2D embeddings be gotten? Is it still by using the vectorize() function?

Mar 08 '22 12:03 Mofetoluwa

@Mofetoluwa yes, we can do vectorize(output_dimension=2) as specified in the latest version.

Also note that the default 1D output should be improved now compared to the version you used most recently.

Mar 08 '22 14:03 artitw

@artitw Oh alright. So should we do a comparison of both?

Then also... adding shorter texts did not really improve the performance of the model. The F1 score and accuracy dropped to about ~0.66.

Mar 08 '22 14:03 Mofetoluwa

Yes, a comparison of both would be useful. Thanks so much for checking the shorter texts. It will help to confirm the fix for the way 1D embeddings are calculated.

Mar 08 '22 15:03 artitw

Hi Mofe,

Are we sampling the data so that each class is balanced when training?
Could we update the README so that users could have some documentation to use the Identifier?

May 15 '22 19:05 artitw

Hi Art,

Yes the dataset is balanced, there are 96 languages and 100 samples for each, so we used 9600 samples to train the current model. I believe the original dataset should have one or two of the 4 languages I didn't find but maybe represented in another name/code. So I'll find that out.
Alright I'll do that shortly...

May 16 '22 08:05 Mofetoluwa

Could we also add the Identifier in the README's class diagram?

May 30 '22 18:05 artitw

@Mofetoluwa, what do you think about using the TFIDF embeddings to perform the language prediction? I think that might be better than the neural embeddings currently used, as it won't have the length dependency.

Sep 17 '22 20:09 artitw

Hi Art @artitw sure that should work actually... I'll try it out and let you know how it goes. I hope you're doing great :)

Sep 26 '22 08:09 Mofetoluwa

great, thanks so much Mofe. Really looking forward to it

Oct 02 '22 01:10 artitw

Hi Mofe, in the latest release I fixed an issue with TFIDF embeddings so that they now output a consistent embedding size. Hope this helps

Nov 19 '22 18:11 artitw

Hi Art,

Alright that's cool :)... I'll work with it and let you know how it goes soon

Nov 24 '22 19:11 Mofetoluwa