CrossLingualContextualEmb icon indicating copy to clipboard operation
CrossLingualContextualEmb copied to clipboard

How to generate the mapping matrixs for the ELMo of my own language.

Open scofield7419 opened this issue 6 years ago • 6 comments

I've trained several ELMo weights for other languages (e.g., Finnish, Chinese ...)(I consider to enrich your repo), and now I wanna align them (including each LSTM layer) to English space, as just you did.

But, it seems you did not release the codes for generating such aligning matrix (like such below). ma

P.S.: note that you only released the code for generating the anchors, and I believe it is nothing to do with the aligning matrix. Or, if I misunderstand the approach, please show the hints, correct my wrong.

So, may I have your prompt reply concerning this issue? Thx a lot.

scofield7419 avatar Sep 27 '19 08:09 scofield7419

Hi @scofield7419 That's great! I'm sure people will find the models and alignments for more languages useful.

The supervised alignment computation was done with the MUSE repository. Their repo is not on installable with pip so I think the best way is to run it with their instructions. I can create a short bash script if it helps.

Use their provided command line, for example:

python supervised.py --src_lang en --tgt_lang es --src_emb data/wiki.en.vec --tgt_emb data/wiki.es.vec --n_refinement 5 --dico_train default

Let me know if that works. When you have the alignment, you are welcome to submit a PR with the new models and matrices. Please also report the word translation accuracies from the MUSE script to make sure that the alignment worked.

TalSchuster avatar Sep 27 '19 15:09 TalSchuster

Hi @TalSchuster , thank you for your reply.

As for MUSE, actually i'm quite familiar with that (frequently used) ^_^

If I guess correct, the first thing I should do to align one language (say Fi.) to En., is to use the get_anchors.py to generate the avg_embeds_{%i}.txt (in which they are the embeddings for each anchor words around the whole vocab) for i-th layer of the LSTM of ELMo, for both En and Fi, respectively. And then align the corresponding embeddings for the i-th EMLo layers, for En and Fi, and output the best_mapping.pth for [0, 1, 2] layers one by one, by MUSE.

Is all above correct? Thx again for everything.

scofield7419 avatar Sep 27 '19 18:09 scofield7419

Yes, that sounds correct. I've uploaded the anchors for the provided English model, so it will save you the time of extracting the English anchors. There's a link now in the main README

TalSchuster avatar Sep 27 '19 18:09 TalSchuster

BTW, there's another thing to ask:

If I wanna generate the multi-lingual ELMo embedding, (I mean the real one, like multi-Bert, bert, multi-bert, not through alignment ), can I just blend enough numbers of sentences from different languages (say, including 10 languages) as the training data for training the ELMo?

Specifically, I may prepare considerable sentences for each languages (just consider 50m sentences for each languages respectively). So is this applicable for generating the real multi-lingual ELMo? And is the training sentences for each lingual sufficient enough?

scofield7419 avatar Sep 27 '19 18:09 scofield7419

For joint-training, you can check this paper. In short - In average it can provide better results but the effectiveness varies across languages. Still, mostly it is worth learning and applying an alignment after the joint-training since even if you train it jointly, there is no strong constraint that makes the cross-lingual representations aligned.

TalSchuster avatar Sep 27 '19 19:09 TalSchuster

Thank you for your response! : )

scofield7419 avatar Sep 28 '19 01:09 scofield7419