topicvec icon indicating copy to clipboard operation
topicvec copied to clipboard

if i want to try it on other language, how to train 25000-180000-500-BLK-8.0.vec.npy? and..

Open zzks opened this issue 8 years ago • 6 comments

hi all,

if i want to try it on other language, how can i train 25000-180000-500-BLK-8.0.vec.npy and get top1grams-wiki.txt? for example chinese language, I have pre trained w2v model of chn wikipedia. Can I get these files from this pre trained model? Thanks!

zzks avatar Jan 27 '17 07:01 zzks

top1grams-wiki.txt is generated by a Perl script https://github.com/askerlee/topicvec/blob/master/psdvec/gramcount.pl. You could generate it using the Chinese Wikipedia text as input. gramcount.pl will also generate top2grams-wiki.txt (two separate runs are needed for top1grams* and top2grams*). Then you use https://github.com/askerlee/topicvec/blob/master/psdvec/factorize.py to generate 25000-180000-500-BLK-8.0.vec, with both top1grams* and top2grams* as input.

You can find an example in https://github.com/askerlee/topicvec/blob/master/psdvec/PSDVec.pdf.

askerlee avatar Jan 27 '17 07:01 askerlee

Roger that! Thank you for the quick response & detailed reply!

zzks avatar Jan 27 '17 10:01 zzks

I noticed that the number of words in "top1grams" is different from the number of words in the word embedding. E.g. for the Wiki-dataset, "top1grams" has 286441 words while word embedding has 180000. Does it matter?

gabrer avatar May 10 '17 23:05 gabrer

It doesn't matter. Words in the word embedding file should be a subset of those in top1grams.txt. Extra words in top1grams.txt will be ignored.

On Thu, May 11, 2017 at 7:34 AM, Gabriele Pergola [email protected] wrote:

I noticed that the number of words in "top1grams" is different from the number of words in the word embedding. E.g. for the Wiki-dataset, "top1grams" has 286441 words while word embedding has 180000. Does it matter?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/askerlee/topicvec/issues/3#issuecomment-300641501, or mute the thread https://github.com/notifications/unsubscribe-auth/ABgKJU1OkZPA6v4X9x-o2oor0yA0x30-ks5r4kmegaJpZM4LvfHi .

askerlee avatar May 11 '17 02:05 askerlee

Hi Askerlee, thank you as usual! :) I have got a problem about the "Mstep_sample_topwords" and I thought it was because of the gap between these two counts. However, it was due to a number of words in the word embedding smaller than "Mstep_sample_topwords". I fixed it.

Thanks!

gabrer avatar May 11 '17 09:05 gabrer

I see. I didn't consider this situation as I mainly use my own embeddings. Yeah it's better to be fixed. Thanks for finding out.

On Thu, May 11, 2017 at 5:27 PM, Gabriele Pergola [email protected] wrote:

Hi Askerlee, thank you as usual! :) I have got a problem about the "Mstep_sample_topwords" and I thought it was because of the gap between these two counts. However, it was due to a number of words in the word embedding smaller than "Mstep_sample_topwords". I fixed it.

Thanks!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/askerlee/topicvec/issues/3#issuecomment-300735033, or mute the thread https://github.com/notifications/unsubscribe-auth/ABgKJegU8G4xGMutjpZVVELCXj-aDnj_ks5r4tRwgaJpZM4LvfHi .

askerlee avatar May 11 '17 12:05 askerlee