textacy icon indicating copy to clipboard operation
textacy copied to clipboard

Extracted topics make no sense; might have something to do with unicodes

Open hedgy123 opened this issue 7 years ago • 5 comments

Hi,

I've just installed the latest version of textacy in python 2.7 on a Mac. I am trying to extract topics from a set of comments that do have quite a few non-ASCII characters. The topics I am getting make no sense.

Here's what's going on. I create a corpus of comments like this:

    corpus = textacy.Corpus('en',texts=the_data)

This create a Corpus(3118 docs; 71018 tokens). If I print out the first three documents/tokens from the corpus, they look normal:

   [Doc(45 tokens; "verrrrry slow pharmacy staff-pharmacist was wai..."),
   Doc(17 tokens; "prices could be a bit lower. service desk could..."),
   Doc(11 tokens; "i got what i wanted at the price i wanted.")]

Then:

vectorizer = textacy.Vectorizer(weighting='tfidf', normalize=True, smooth_idf=True,
                            min_df=2, max_df=0.95)
doc_term_matrix = vectorizer.fit_transform(
             (doc.to_terms_list(ngrams=1, named_entities=True, as_strings=True) 
              for doc in corpus))
# initialize and train topic model
model = textacy.tm.TopicModel('nmf', n_topics=10)
model.fit(doc_term_matrix)
doc_topic_matrix = model.transform(doc_term_matrix)

for topic_idx, top_terms in model.top_topic_terms(vectorizer.id_to_term, top_n=10):
     print('topic', topic_idx, ':', '   '.join(top_terms))

And that's where I get back "topics" that make no sense:

  (u'topic', 0, u':', u"be   's   p.m.   -PRON-   because   will   would   have   not")
  (u'topic', 1, u':', u"not   p.m.   because   's   -PRON-   will   would   have   be")
  (u'topic', 2, u':', u"'s   p.m.   -PRON-   because   will   would   have   not   be")
  (u'topic', 3, u':', u"'s   p.m.   -PRON-   because   will   would   have   not   be")
  (u'topic', 4, u':', u"'s   p.m.   -PRON-   because   will   would   have   not   be")
  (u'topic', 5, u':', u"have   's   p.m.   -PRON-   because   will   would   not   be")
  (u'topic', 6, u':', u"'s   p.m.   -PRON-   because   will   would   have   not   be")
  (u'topic', 7, u':', u"will   's   p.m.   -PRON-   because   would   have   not   be")
  (u'topic', 8, u':', u"would   's   p.m.   -PRON-   because   will   have   not   be")
  (u'topic', 9, u':', u"'s   p.m.   -PRON-   because   will   would   have   not   be")

Somehow, the fact that everything comes up with "u's" seems to indicate to me that unicodes are potentially messing things up, but I am not sure how to fix that. The printed corpus seemed perfectly fine.

Could you please help? Thanks a lot!

hedgy123 avatar Sep 26 '17 16:09 hedgy123

Hey @hedgy123 , it's hard for me to tell what's going wrong here, but since your code looks correct, I'm guessing that the garbage topics result from some combination of problems with the data, the term normalization, and the parameters for the topic model being trained.

Here are a few things to try:

  1. Confirm that there aren't duplicates in your training data, since that has been known to negatively affect topic model outputs.
  2. Don't lemmatize your terms in doc.to_terms_list(), by specifying either normalize='lower' or normalize=False.
  3. Try a different model type, i.e 'lda' or 'lsa'. Try varying your topic model's n_topics, both higher and lower. Try increasing your max_iter, in case the model is simply failing to converge.

If none of that works, I'd assume that either your corpus isn't conducive to topic modeling (ugh for you) or there's a bug somewhere in textacy (ugh for me). Please let me know how your experiments go!

bdewilde avatar Sep 26 '17 17:09 bdewilde

From the top of my head, I think it might be some escaping formatting issues related to ' and ", because of this structure (u'topic', 0, u':', u"be 's

It's worth trying to escape them properly or remove them altogether from your raw data before pushing it to doc.to_terms_list()

This might help to escape them if you want to keep the punctuation: https://stackoverflow.com/questions/18935754/how-to-escape-special-characters-of-a-string-with-single-backslashes

LeonardoReyes avatar Sep 27 '17 09:09 LeonardoReyes

@hedgy123 I get the same issue, the topics do not make sense - did you figure out what the problem was?

anamariakantar avatar Oct 02 '17 20:10 anamariakantar

Ran into same issue here🤔

lyons422 avatar Oct 04 '17 04:10 lyons422

Okay, sounds like I should confirm that topic models behavior is expected... I've been punting on major textacy development while I wait for the official spacy v2 release, but this issue is probably independent of that. Will let y'all know if I find anything.

bdewilde avatar Oct 05 '17 16:10 bdewilde