seededlda
seededlda copied to clipboard
Add goodness of fit metrics
I would like to know if there is any implementation of the standard goodness-of-fit metrics for your textmodel_lda
class. For instance, here is a SO post which didn't get much attraction. I am wondering if in the case of seeded-LDA, standard metrics still apply.
Could you please give me some information about any upcoming implementation, if any? Or, could you please suggest a direct method that applies to your object class? Thanks!
Thank you for the post.
I did not think users of seeded LDA should worry about model fit because the number of topics is theoretically determined, but they might need a way to determine k
for unseeded LDA. I have to do research on how to compute perprexity but divergence measure is straight forward as below. According to the statistic, k
should be around 10.
require(seededlda)
require(quanteda)
require(Matrix)
data("data_corpus_moviereviews", package = "quanteda.textmodels")
corp <- head(data_corpus_moviereviews, 500)
toks <- tokens(corp, remove_punct = TRUE, remove_symbols = TRUE, remove_number = TRUE)
dfmt <- dfm(toks) %>%
dfm_remove(stopwords('en'), min_nchar = 2) %>%
dfm_trim(min_termfreq = 0.90, termfreq_type = "quantile",
max_docfreq = 0.1, docfreq_type = "prop")
for (k in seq(5, 50, 5)) {
lda <- textmodel_lda(head(dfmt, 450), k)
div <- proxyC::dist(lda$phi, method = "kullback")
diag(div) <- NA
fit <- mean(div, na.rm = TRUE)
cat(k, fit, "\n")
}
5 3.456055
10 3.487706
15 3.453193
20 3.413442
25 3.312763
30 3.221957
35 3.166234
40 3.076886
45 2.98464
50 2.935299
Deveaud used the Jensen-Shannon divergence but I am using Kullback here, because proxyC does not have the measure (I probably should add).
I can make changea in the LDA functions to return the divergence measure if it is desired, but there is no guarantee that the topics are most meaningful when the statistic is the highest.
Thanks for your answer. Here are few thoughts.
I did not think users of seeded LDA should worry about model fit because the number of topics is theoretically determined, but they might need a way to determine k for unseeded LDA.
I agree in principle. My line of thinking is as follows. Even if it is true that seeded LDA returns a pre-specified number of topics, this is extremely subjective due to me deciding what keywords should the model worry about. Let's assume instead that I come up with an evident misspecification of the topics. How would I know that phi
and theta
contains estimations that are by construction wrong?
For instance, can we use the residual estimation as a robustness check of the efficacy of the model?
Deveaud used the Jensen-Shannon divergence but I am using Kullback here, because proxyC does not have the measure (I probably should add).
That's absolutely fine. KL divergence works just ok.
I can make changea in the LDA functions to return the divergence measure if it is desired, but there is no guarantee that the topics are most meaningful when the statistic is the highest.
Totally agree on this one but again how would I know how coherent (i.e., interpretable) my topics are?
I am well aware of the problems around optimal topic identification in LDA. As a matter of fact, we implemented a method to assess the optimality which uses a simple chi-square test instead of adopting the perplexity index as a robust goodness of fit metric.
I guess the big question is whether or not we can compute any measure of likelihood out of your model. If the answer is no, then the approach is basically not testable in terms of its explanatory power. If the answer is yes, then the question is how can we compute it to come up with a similar metric w.r.t. the perplexity?
Sorry for being verbose but I think the concern is real and has to be addressed. Thoughts?
Thank you the link to your project. I will read the paper.
I understand that users are always unsure about their choice of seed words, so I am willing to offer some indicator. The divergence measure is easy to add via a new function divergence()
or something.
It is nice to offer the likelihood of parameters (e.g. perplexity), but we can do something similar by re-training an existing model on new data, and compare between the old and new models. If the old model has a good fit, the topic-word distribution should not change when trained on the new data.
In this example, when k = 10, the KL divergence between topics in the old model is higher, but it is smaller between the corresponding topics in the old and new models. These suggest that k = 10 is better than k = 20.
require(seededlda)
require(quanteda)
require(Matrix)
data("data_corpus_moviereviews", package = "quanteda.textmodels")
corp <- head(data_corpus_moviereviews, 500)
toks <- tokens(corp, remove_punct = TRUE, remove_symbols = TRUE, remove_number = TRUE)
dfmt <- dfm(toks) %>%
dfm_remove(stopwords('en'), min_nchar = 2) %>%
dfm_trim(min_termfreq = 0.90, termfreq_type = "quantile",
max_docfreq = 0.1, docfreq_type = "prop")
divergence <- function(x) {
div <- proxyC::dist(x$phi, method = "kullback")
diag(div) <- NA
mean(div, na.rm = TRUE)
}
lda10a <- textmodel_lda(head(dfmt, 450), 10)
lda20a <- textmodel_lda(head(dfmt, 450), 20)
divergence(lda10a)
#> [1] 3.538062
divergence(lda20a)
#> [1] 3.379323
lda10b <- textmodel_lda(tail(dfmt, 50), model = lda10a)
#> Warning: k, alpha and beta values are overwriten by the fitted model
lda20b <- textmodel_lda(tail(dfmt, 50), model = lda20a)
#> Warning: k, alpha and beta values are overwriten by the fitted model
mean(diag(proxyC::dist(lda10a$phi, lda10b$phi, method = "kullback", diag = TRUE)))
#> [1] 0.01251931
mean(diag(proxyC::dist(lda20a$phi, lda20b$phi, method = "kullback", diag = TRUE)))
#> [1] 0.01661152
What do you think?
Thanks again for the support. I like the approach. Though, as we agreed, it isn't always the case that the better model happens to have a higher gof metric. If you could add a function like divergence()
to the package that would be great of course.
At this point, my feeling is that, in the absence of a statistical test, we should include other metrics along the line of topic coherence or similar. topicmodels does not seem to ship that but I think it is one of the most reliable ways to assess the interpretability of the topics on top of their statistical sound. I am sure you are familiar with the metric, but here is the paper that introduced it.
The package topicdoc (which seems to be left behind in terms of development) does appear to have implemented a way to compute topic coherence on tm-class
object estimated with topicmodels. The function is called topic_coherence()
.
Thoughts?
I will add divergence(x)
the package first. Then, I would also add coherence()
. Whether the two measures agree with each other on optimal k
is a question that we should study empirically.
That sounds great, thank you very much!
Regarding the accordance of the two measures, well that's an interesting questions which demands for some answer.
This is my cohesion function, but the statistic is only gets lower as k get higher...
coherence <- function(x, n = 10) {
h <- apply(terms(x, n), 2, function(y) {
d <- x$data[,y]
e <- Matrix::Matrix(docfreq(d), nrow = nfeat(d), ncol = nfeat(d))
f <- fcm(d, count = "boolean") + 1
g <- Matrix::band(log(f / e), 1, ncol(f))
sum(g)
})
sum(h)
}
for (k in seq(5, 50, 5)) {
lda <- textmodel_lda(head(dfmt, 450), k)
coh <- coherence(lda)
cat(k, coh, "\n")
}
5 -433.7141
10 -895.6645
15 -1329.028
20 -1746.271
25 -2126.106
30 -2503.134
35 -2987.627
40 -3312.912
45 -3686.489
50 -3957.752
That's weird. Topic coherence should increase with k
not the other way around. What am I missing here?
I could be my bad, but I don't know what is wrong in my code.
Hi, I'm following your codes in the above. But proxy responds an error as below;
k <- 5 lda <- textmodel_lda(head(dfmt, 450), k) div <- proxyC::dist(lda$phi, method = "kullback") Error in proxy(x, y, margin, method, p = p, smooth = smooth, drop0 = drop0, : x must be a sparseMatrix
Is there any conditions to use proxyC::dist? Thanks,
You need the latest proxyC. Fixed via 8190c46.
It works. Thank you.