punctuator2 icon indicating copy to clipboard operation
punctuator2 copied to clipboard

Changin the language

Open flovera1 opened this issue 7 years ago • 4 comments

Hi, I wanted to know if another language have the same pattern?. Here you treated the English file as GloVe, but I'm trying to do something similar to this project using Dutch as source language. How can I have the GloVe version of Dutch ?. Thank you.

flovera1 avatar Mar 07 '17 09:03 flovera1

Hi, the pre-trained word embeddings don't have to necessarily be GloVe vectors, but can also be word2vec vectors. I found some pre-trained Dutch vectors from here: https://github.com/clips/dutchembeddings For these to work with punctuator2, you need to remove the header (first line) from the txt files. Best, Ottokar

ottokart avatar Mar 07 '17 15:03 ottokart

Hi, Just to make sure that I'm following the comments:

  1. IF you wanted to use pre-trained word embeddings for your language (say French or Spanish) because you had a small training data set, we can follow the advice above regarding GloVe or word2vec and maybe even use https://gist.github.com/ottokart/673d82402ad44e69df85 to make a We.pcl file. Is that correct?

  2. However, based on code I'm reading here: https://github.com/ottokart/punctuator2/blob/master/models.py#L128 it is not strictly required to use a pre-trained word-embedding IF you have lots of training data for your language. Is that also correct?

Thanks you very much and great work!

ruohoruotsi avatar Oct 30 '17 22:10 ruohoruotsi

  1. To use pre-trained embeddings you should point PRETRAINED_EMBEDDINGS_PATH to the embeddings file in text format in data.py (data.py will build the required We.pcl file from the text file). This text file should be limited to some reasonable amount of top words (e.g. 100 000). I created a new gist for creating this text file from a binary gzipped word2vec file https://gist.github.com/ottokart/4031dfb471ad5c11d97ad72cbc01b934
  2. That's correct - using pre-trained word embeddings is completely optional (they helped me on TED Talks dataset, but on larger datasets I don't generally use them).

ottokart avatar Nov 02 '17 14:11 ottokart

Hi, I found some word2vec files of my language and I also find your two scripts, that convert .bin(.vec) to embeddings.txt https://gist.github.com/ottokart/4031dfb471ad5c11d97ad72cbc01b934 and convert .bin(.vec) to .pcl https://gist.github.com/ottokart/673d82402ad44e69df85 I tried to use the second script and in a result I got "myEmbedings.pcl" but it didn't work with punctuator.

How I can adopt my word2vec file to punctuator and what's difference between two scripts?

Thank you in advance

MeteorBurn avatar Jul 16 '18 09:07 MeteorBurn