textacy
textacy copied to clipboard
Extracted topics make no sense; might have something to do with unicodes
Hi,
I've just installed the latest version of textacy in python 2.7 on a Mac. I am trying to extract topics from a set of comments that do have quite a few non-ASCII characters. The topics I am getting make no sense.
Here's what's going on. I create a corpus of comments like this:
corpus = textacy.Corpus('en',texts=the_data)
This create a Corpus(3118 docs; 71018 tokens). If I print out the first three documents/tokens from the corpus, they look normal:
[Doc(45 tokens; "verrrrry slow pharmacy staff-pharmacist was wai..."),
Doc(17 tokens; "prices could be a bit lower. service desk could..."),
Doc(11 tokens; "i got what i wanted at the price i wanted.")]
Then:
vectorizer = textacy.Vectorizer(weighting='tfidf', normalize=True, smooth_idf=True,
min_df=2, max_df=0.95)
doc_term_matrix = vectorizer.fit_transform(
(doc.to_terms_list(ngrams=1, named_entities=True, as_strings=True)
for doc in corpus))
# initialize and train topic model
model = textacy.tm.TopicModel('nmf', n_topics=10)
model.fit(doc_term_matrix)
doc_topic_matrix = model.transform(doc_term_matrix)
for topic_idx, top_terms in model.top_topic_terms(vectorizer.id_to_term, top_n=10):
print('topic', topic_idx, ':', ' '.join(top_terms))
And that's where I get back "topics" that make no sense:
(u'topic', 0, u':', u"be 's p.m. -PRON- because will would have not")
(u'topic', 1, u':', u"not p.m. because 's -PRON- will would have be")
(u'topic', 2, u':', u"'s p.m. -PRON- because will would have not be")
(u'topic', 3, u':', u"'s p.m. -PRON- because will would have not be")
(u'topic', 4, u':', u"'s p.m. -PRON- because will would have not be")
(u'topic', 5, u':', u"have 's p.m. -PRON- because will would not be")
(u'topic', 6, u':', u"'s p.m. -PRON- because will would have not be")
(u'topic', 7, u':', u"will 's p.m. -PRON- because would have not be")
(u'topic', 8, u':', u"would 's p.m. -PRON- because will have not be")
(u'topic', 9, u':', u"'s p.m. -PRON- because will would have not be")
Somehow, the fact that everything comes up with "u's" seems to indicate to me that unicodes are potentially messing things up, but I am not sure how to fix that. The printed corpus seemed perfectly fine.
Could you please help? Thanks a lot!
Hey @hedgy123 , it's hard for me to tell what's going wrong here, but since your code looks correct, I'm guessing that the garbage topics result from some combination of problems with the data, the term normalization, and the parameters for the topic model being trained.
Here are a few things to try:
- Confirm that there aren't duplicates in your training data, since that has been known to negatively affect topic model outputs.
- Don't lemmatize your terms in
doc.to_terms_list()
, by specifying eithernormalize='lower'
ornormalize=False
. - Try a different model type, i.e
'lda'
or'lsa'
. Try varying your topic model'sn_topics
, both higher and lower. Try increasing yourmax_iter
, in case the model is simply failing to converge.
If none of that works, I'd assume that either your corpus isn't conducive to topic modeling (ugh for you) or there's a bug somewhere in textacy
(ugh for me). Please let me know how your experiments go!
From the top of my head, I think it might be some escaping formatting issues related to '
and "
, because of this structure (u'topic', 0, u':', u"be 's
It's worth trying to escape them properly or remove them altogether from your raw data before pushing it to doc.to_terms_list()
This might help to escape them if you want to keep the punctuation: https://stackoverflow.com/questions/18935754/how-to-escape-special-characters-of-a-string-with-single-backslashes
@hedgy123 I get the same issue, the topics do not make sense - did you figure out what the problem was?
Ran into same issue here🤔
Okay, sounds like I should confirm that topic models behavior is expected... I've been punting on major textacy
development while I wait for the official spacy
v2 release, but this issue is probably independent of that. Will let y'all know if I find anything.