warplda icon indicating copy to clipboard operation
warplda copied to clipboard

The format about output.

Open ayumiymk opened this issue 7 years ago • 6 comments

In the .info.full.txt, the format over each line is #line #topic id (prob, word)... . But what is the meaning of topic id? Where can I find the corresponding topic word? I try to find it in the vocabulary with the id as the line number. But failed. What's more, given a document, how to get its topics? I am sorry for these trivial questions. Thank you firstly.

ayumiymk avatar Mar 17 '17 09:03 ayumiymk

The topic id is the id of the topic. You can find the corresponding topic in the corresponding line in .model, which is a row of counts, each of which is the number of occurrences of a word in this topic.

You can get its topics from the file .z.estimate or .z.inference, depending on whether the document is from the training or testing set. These files contains the topic assignments for each word in the documents, and you can just count them if you want the document-topic counts. If you want the document-topic distribution, you can calculate it as

theta_k = (C_k + \alpha) / (\sum_k C_k + K\alpha)

where \theta is the distribution, C_k is the count, and \alpha is the hyper-parameter.

cjf00000 avatar Mar 17 '17 12:03 cjf00000

Thanks.

But I am still confused. For example, the topic id of the first line in the train.info.full.txt is 4132, but the line 4132 of train.model is 1 49:1. What's the meaning of it?

Thank you very much.

ayumiymk avatar Mar 18 '17 02:03 ayumiymk

It means the topic on line 4132 has only one word, word #49, assigned to it once.

Notice however, the topic id may start from 0 (I forgot whether it starts from 0 or 1) and there is a leading line in train.model. You may wish to inspect line 4134 in train.model instead.

cjf00000 avatar Mar 18 '17 02:03 cjf00000

I see train.model has a more leading line than .vocab. Do them have the same index? For example, the line 4134 in the train.model corresponds to the word of line 4133 in .vocab?

In my understanding, I first find the topic id in train.info.full.txt, and then find the corresponding topic in the corresponding line(the topic id +2 or +1?) in .model. Have I understood correctly?

ayumiymk avatar Mar 18 '17 08:03 ayumiymk

Sorry, I forgot train.model is a vocabulary size * number of topics sparse matrix.

So, you should look at the 4132-th column in train.model instead of the 4132-th row. Each row in train.model is a word instead of a topic.

cjf00000 avatar Mar 18 '17 11:03 cjf00000

But I only have 100 topics. So the number of columns of train.model is less than or equal to 100. I can't get the 4132-th column?

ayumiymk avatar Mar 18 '17 13:03 ayumiymk