GuidedLDA icon indicating copy to clipboard operation
GuidedLDA copied to clipboard

Seed co-presence per document.

Open gam-ba opened this issue 6 years ago • 4 comments

Hello, Vikash.

To begin with, thanks for this excellent work. GuidedLDA is a really helpful and sharp tool for unsupervised "label propagation".

I'm not sure if it's really an issue, but I was wondering whether there was any way of weighting seed-term co-presence in documents. I'm working on a rather small corpus (~60,000 short comments from a change.org petition) where most of the comments mix at least two of the seeded topics.

However, when fitting the GuidedLDA model, it seems to assign the topic based on the first seed appearing in the document. This is not a problem per se, since we can retrieve the assignation values per topic per comment...

But here's the thing: the algorithm labels the comment from the first seed with a 0.9 value, when I would expect a much weaker assignation due to the co-presence of seed-terms.

Is there any way to consider this?

I'm thinking in something like the doc_topic_prior parameter, similar to Scikit's LDA implementation for the LDA's alpha parameter.

Again, thank you very much!

Guido

gam-ba avatar Nov 17 '17 13:11 gam-ba

@gam-ba What is the seed_confidence value you are using at the fit step?

model.fit(X, seed_topics=seed_topics, seed_confidence=0.15)

vi3k6i5 avatar Nov 17 '17 13:11 vi3k6i5

I've been using values ranging from 0.2 to 0.01. At least for my corpus, lower the value, better the result. That's expected, right?

gam-ba avatar Nov 17 '17 14:11 gam-ba

Ideally if you are getting good results for lower value of seed_confidence, then you should try without seeding as well.

Try the other fit method and see how that works for you.

model.fit(X)

Let me know how that goes then we can decide how to handle seeding (or if its even required) :)

PS: Email me if that's ok with you.

vi3k6i5 avatar Nov 17 '17 14:11 vi3k6i5

@gam-ba @vi3k6i5 did you have any success in finding a solution to your question? I have actually come across the same issue that you have described and would like to see if you could provide some insight. Thanks!

nickkimer avatar Dec 16 '18 05:12 nickkimer