topicvec
topicvec copied to clipboard
if i want to try it on other language, how to train 25000-180000-500-BLK-8.0.vec.npy? and..
hi all,
if i want to try it on other language, how can i train 25000-180000-500-BLK-8.0.vec.npy and get top1grams-wiki.txt? for example chinese language, I have pre trained w2v model of chn wikipedia. Can I get these files from this pre trained model? Thanks!
top1grams-wiki.txt is generated by a Perl script https://github.com/askerlee/topicvec/blob/master/psdvec/gramcount.pl. You could generate it using the Chinese Wikipedia text as input. gramcount.pl will also generate top2grams-wiki.txt (two separate runs are needed for top1grams* and top2grams*). Then you use https://github.com/askerlee/topicvec/blob/master/psdvec/factorize.py to generate 25000-180000-500-BLK-8.0.vec, with both top1grams* and top2grams* as input.
You can find an example in https://github.com/askerlee/topicvec/blob/master/psdvec/PSDVec.pdf.
Roger that! Thank you for the quick response & detailed reply!
I noticed that the number of words in "top1grams" is different from the number of words in the word embedding. E.g. for the Wiki-dataset, "top1grams" has 286441 words while word embedding has 180000. Does it matter?
It doesn't matter. Words in the word embedding file should be a subset of those in top1grams.txt. Extra words in top1grams.txt will be ignored.
On Thu, May 11, 2017 at 7:34 AM, Gabriele Pergola [email protected] wrote:
I noticed that the number of words in "top1grams" is different from the number of words in the word embedding. E.g. for the Wiki-dataset, "top1grams" has 286441 words while word embedding has 180000. Does it matter?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/askerlee/topicvec/issues/3#issuecomment-300641501, or mute the thread https://github.com/notifications/unsubscribe-auth/ABgKJU1OkZPA6v4X9x-o2oor0yA0x30-ks5r4kmegaJpZM4LvfHi .
Hi Askerlee, thank you as usual! :) I have got a problem about the "Mstep_sample_topwords" and I thought it was because of the gap between these two counts. However, it was due to a number of words in the word embedding smaller than "Mstep_sample_topwords". I fixed it.
Thanks!
I see. I didn't consider this situation as I mainly use my own embeddings. Yeah it's better to be fixed. Thanks for finding out.
On Thu, May 11, 2017 at 5:27 PM, Gabriele Pergola [email protected] wrote:
Hi Askerlee, thank you as usual! :) I have got a problem about the "Mstep_sample_topwords" and I thought it was because of the gap between these two counts. However, it was due to a number of words in the word embedding smaller than "Mstep_sample_topwords". I fixed it.
Thanks!
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/askerlee/topicvec/issues/3#issuecomment-300735033, or mute the thread https://github.com/notifications/unsubscribe-auth/ABgKJegU8G4xGMutjpZVVELCXj-aDnj_ks5r4tRwgaJpZM4LvfHi .