GuidedLDA Potential issue

Potential issue

Open drd13 opened this issue 6 years ago • 8 comments

In your readme for the guidedlda module you showed the behaviour of the algorithm on the NYT dataset. I tried running the example code you provided, with the same seeds and parameters, but increasing the lda's number of iterations from 100 to 1000. Doing this I obtained very similar topics for the guided and unguided topics.

The topics were, for the unguided lda Topic 0: company percent market business price sell executive president Topic 1: game play team win season player second victory Topic 2: life play man write woman thing young child Topic 3: building city place area small house water home Topic 4: official state government issue case member public political

and for the guided lda Topic 0: game play team win player season second start victory point Topic 1: company percent market price business sell executive sale buy cost Topic 2: life play man thing woman write book old young world Topic 3: official state government issue case political public states member leader Topic 4: city building police area home house car father live yesterday

These topics are pretty much identical (the ordering of a few words in the topics is different). This suggests, that the algorithm you have implemented, when run to convergence, is identical to the regular lda.

If my understanding is correct, the algorithm described in Jagadeesh Jagarlamudi, Hal Daume III and Raghavendra Udupa (2012) is more involved, and requires a change to the generative model and thus to the collapsed gibbs sampling formula. Your algorithm seems to only be using the seed for the initialization.

I was wondering if you could shed some lights on these issues?

Mar 02 '18 18:03 drd13

Sure thing, The example is based on a really small dataset, so it will be hard to see significant difference.

You are right about the initialisation, As described in the blog post I have only manipulated the initialisation and let the LDA algorithm do its magic post that.

Didn't wanted to create topics which don't actually have enough strength to become a topic of it's own.

If you can elaborate more on this part, I will be able to explain better.

On Sat, 3 Mar 2018 at 00:17 drd13 [email protected] wrote:

In your readme for the guidedlda module you showed the behaviour of the algorithm on the NYT dataset. I tried running the example code you provided, with the same seeds and parameters, but increasing the lda's number of iterations from 100 to 1000. Doing this I obtained very similar topics for the guided and unguided topics.

The topics were, for the unguided lda

Topic 0: company percent market business price sell executive president Topic 1: game play team win season player second victory Topic 2: life play man write woman thing young child Topic 3: building city place area small house water home Topic 4: official state government issue case member public political

and for the guided lda

Topic 0: game play team win player season second start victory point Topic 1: company percent market price business sell executive sale buy cost Topic 2: life play man thing woman write book old young world Topic 3: official state government issue case political public states member leader Topic 4: city building police area home house car father live yesterday

These topics are pretty much identical (the ordering of a few words in the topics is different). This suggests, that the algorithm you have implemented, when run to convergence, is identical to the regular lda.

If my understanding is correct, the algorithm described in Jagadeesh Jagarlamudi, Hal Daume III and Raghavendra Udupa (2012) is more involved, and requires a change to the generative model and thus to the collapsed gibbs sampling formula. Your algorithm seems to only be using the seed for the initialization.

I was wondering if you could shed some lights on these issues?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/vi3k6i5/GuidedLDA/issues/13, or mute the thread https://github.com/notifications/unsubscribe-auth/AC-NwgrigT7IRz_jobsrofg2cM78CxoZks5taZPEgaJpZM4SaVLe .

Mar 02 '18 18:03 vi3k6i5

I am not expert in latent Dirichlet allocation so I would be curious to know if you agree with my interpretation.

It seems to me, from your README file, that you are treating the algorithm you implemented and the algorithm in the "Incorporating Lexical Priors into Topic Models" as identical algorithms.

The modified algorithm, in the paper, updates the generative process of lda to incorporate the seeded words. Any such modifications to the generative process would lead to modifications of the collapsed gibbs sampling (as the EM maximisation would be different).

In your implementation, as you do not modify the collapsed gibbs sampling, you do not actually modify the generative process of the algorithm. To explain the differences using an analogy, if you considered the lda as an algorithm trying to minimize a function, your implementation would be modifying the starting point (the initialization here) from which the algorithm would try to find the minimum, while the implementation in the paper would be modifying the function being minimized.

If LDA does sometimes converge to local minima then your algorithm could be interesting as a way to steer towards desired local minima. Otherwise it could be useful for quicker convergence. But, if my understanding of LDA is correct, the guiding of the lda in your implementation is considerably weaker that that of the algorithm in the paper.

Mar 03 '18 15:03 drd13

I was reading up on collapsed Gibbs sampling and I would also not call myself an expert. But anyways, collapsed Gibbs sampling is present in a lot of LDA implementations. As much as I know about GuidedLDA code and the base LDA code. They are also built on collapsed Gibbs sampling.

The computation is done with 3 matrices, ndz, nz, nzw. The same approach is explained in collapsed Gibbs sampling..

I might be wrong though. Please let me know if that is the case.

Thanks :)

On Sat 3 Mar, 2018, 20:34 drd13, [email protected] wrote:

I am not expert in latent Dirichlet allocation so I would be curious to know if you agree with my interpretation.

It seems to me, from your README file, that you are treating the algorithm you implemented and the algorithm in the "Incorporating Lexical Priors into Topic Models" as identical algorithms.

The modified algorithm, in the paper, updates the generative process of lda to incorporate the seeded words. Any such modifications to the generative process would lead to modifications of the collapsed gibbs sampling (as the EM maximisation would be different).

In your implementation, as you do not modify the collapsed gibbs sampling, you do not actually modify the generative process of the algorithm. To explain the differences using an analogy, if you considered the lda as an algorithm trying to minimize a function, your implementation would be modifying the starting point (the initialization here) from which the algorithm would try to find the minimum, while the implementation in the paper would be modifying the function being minimized.

If LDA does sometimes converge to local minima then your algorithm could be interesting as a way to steer towards desired local minima. Otherwise it could be useful for quicker convergence. But, if my understanding of LDA is correct, the guiding of the lda in your implementation is considerably weaker that that of the algorithm in the paper.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/vi3k6i5/GuidedLDA/issues/13#issuecomment-370153902, or mute the thread https://github.com/notifications/unsubscribe-auth/AC-Nwu1FlixKPJNyDPoFskU9TzTeSbBqks5tarEQgaJpZM4SaVLe .

Mar 03 '18 15:03 vi3k6i5

The issue is that the mathematical expression for the collapsed gibbs sampling is dependent on the generative process of the data. This can easily seen from the repository associated to the paper which has a different expression for the gibbs sampling https://github.com/bsou/cl2_project/tree/master/SeededLDA .

Mar 07 '18 11:03 drd13

Yes true. The approach used in both cases are different. We did our own approach.

Hence I said it's based on the paper, but not a complete implementation.

Please share any information that I can read to learn better about this. I haven't worked on this project in a long time. And honestly all the deep learning projects are working so much better that I don't think these projects will have much of a future. They have a present but probably not for long.

Happy to improve on this project in spare time though. Please do share whatever information you have on the other approach.

On Wed 7 Mar, 2018, 16:47 drd13, [email protected] wrote:

The issue is that the mathematical expression for the collapsed gibbs sampling is dependent on the generative process of the data. This can easily seen from the repository associated to the paper which has a different expression for the gibbs sampling https://github.com/bsou/cl2_project/tree/master/SeededLDA .

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/vi3k6i5/GuidedLDA/issues/13#issuecomment-371107000, or mute the thread https://github.com/notifications/unsubscribe-auth/AC-Nwi_FPIaEz3K06m37lXbE-Ngarf6Lks5tb8HSgaJpZM4SaVLe .

Mar 07 '18 11:03 vi3k6i5

And honestly all the deep learning projects are working so much better that I don't think these projects will have much of a future.

Is there a neural net alternative to LDA? I haven't found anything comparable to unsupervised topic modelling from the deep learning community but perhaps I'm missing something.

Mar 23 '18 12:03 dldx

Only labelling the data manually and building a supervised classifier. That's what I know. There is a word2vecLDA but I don't think that project is maintained anymore.

On Fri 23 Mar, 2018, 18:28 dldx, [email protected] wrote:

And honestly all the deep learning projects are working so much better that I don't think these projects will have much of a future.

Is there a neural net alternative to LDA? I haven't found anything comparable to unsupervised topic modelling from the deep learning community but perhaps I'm missing something.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/vi3k6i5/GuidedLDA/issues/13#issuecomment-375656802, or mute the thread https://github.com/notifications/unsubscribe-auth/AC-NwrMNqP3LJb-KyglPTDRQfSIsl1JPks5thPFcgaJpZM4SaVLe .

Mar 23 '18 12:03 vi3k6i5

Ah, okay. That's what I thought :) Thanks for the reply!

Mar 23 '18 13:03 dldx

GuidedLDA GuidedLDA copied to clipboard

Potential issue

GuidedLDA
GuidedLDA copied to clipboard