texthero icon indicating copy to clipboard operation
texthero copied to clipboard

Topic Modelling and Visualization

Open mk2510 opened this issue 3 years ago • 12 comments

This PR implements support for Topic Modelling in Texthero (see #42). Maybe see the showcasing notebook first before reading this.

Overview

We implement 5 new functions:

  • lda (Latent Dirichlet Allocation (LDA))
  • truncatedSVD (truncated Singular Value Decomposition), same as Latent Semantic Analysis / Indexing (LSA / LSI)
  • visualize_topics to visualize topics with pyLDAvis
  • topics_from_topic_model to get topics for documents after using lda/tSVD
  • top_words_per_document to get the most relevant words ("keywords") for every document
  • top_words_per_topic to get the most relevant words for every topic (=cluster)

There are now two main ways for users to find, visualize, and understand the topics in their datasets:

  1. tfidf/count/term_frequency [optional: -> flair embeddings] [optional: -> dimensionality reduction, tSVD] -> clustering. The clusters are now understood as "topics". Users can now use e.g. visualize_topics(s_tfidf, s_clustered) to see their clusters/topics visualized, and they can do top_words_per_topic(s_tfidf, s_clustered) to get the most relevant words for every cluster.

  2. tfidf/count/term_frequency -> lda. Users can now use e.g. visualize_topics(s_tfidf, s_lda) to see the topics found by lda visualized, and they can do s_topics = topics_from_topic_model(s_tfidf, s_lda) to get the best-matching topic for every document and then do top_words_per_topic(s_tfidf, s_clustered) to get the most relevant words for every topic.

The new functions in detail (excerpts of their docstrings + some explanations)

LDA

lda(s: Union[VectorSeries, DocumentTermDF], n_components=10, max_iter=10, random_state=None, n_jobs=-1) -> VectorSeries

This is a very straightforward implementation of sklearn's LDA. LDA returns a matrix with dimensions number of documents X number of topics ("document-topic-matrix") that relates documents to topics (document_topic_matrix[i][j] says how strongly document i belongs to matrix j (unnormalized!)).

truncatedSVD

Like e.g. PCA; see this for an example of using the sklearn implementation. As we can see, it'll be used like e.g. PCA.

visualize_topics

visualize_topics(s_document_term: DocumentTermDF, s_document_topic: Union[VectorSeries, CategorySeries (issue 164)], show_in_new_window=False, return_figure=False)

This is our coolest new function; it visualizes the topics interactively. It builds upon pyLDAvis and is extended in such a way as to allow us to not be restricted to LDA to profit from the great visualization interface.

The first input is the output of tfidf/term_frequency/count. This gives us a relation (/matrix) document->terms. The second input has to give us a relation document->topic. This can either be the output of one of our clustering functions (then the clusters are the topics, so we have one topic per document; we create a document-topic-matrix from that) or of lda (then as described above in lda, we have a document-topic-matrix right there already).

From those two relations (documents->topics, documents->terms), the function calculates a distribution of documents to topics, and a distribution of topics to terms (similarly to pyLDAvis internally, but we extend it for clustering input and not only LDA). These distributions are then passed to pyLDAvis, which visualizes them. The function visualize_topics and its helper functions are really well documented :2nd_place_medal: , so it should be clear what's happening in the code after reading this.

topics_from_topic_model

topics_from_topic_model(s_document_topic: VectorSeries) -> CategorySeries (issue 164)

Find the topics from a topic model. Input has to be output of one of lda, truncatedSVD, so the output of one of Texthero's Topic Modelling functions that returns a relation between documents and topics (the document_topic_matrix). The function uses the given relation of documents to topics to calculate the best-matching topic per document and returns a Series with the topic IDs.

The document_topic_matrix relates documents to topics, so it shows for each document (so for each row), how strongly that document belongs to a topic. So document_topic_matrix[X][Y] = how strongly document X belongs to topic Y (as explained above). We use np.argmax to find the index of the topic that a document belongs most strongly to for each document (so for each row). E.g. when the first row of the document_topic_matrix is [0.2, 0.1, 0.2, 0.5], then the first document will be put into topic / cluster 3 as the third entry (counting from 0) is the best matching topic.

We return a CategorySeries (see #164), so a series with a ID per document describing to which cluster it belongs.

top_words_per_topic

top_words_per_topic(s_document_term: DocumentTermDF, s_clusters: CategorySeries, n_words=5) -> TokenSeries

The function takes as first input a DocumentTermDF (so output of tfidf, term_frequency, count) and as second input a CategorySeries (see #164) that assigns a topic/cluster to every document (so output of a clustering function or topics_from_topic_model).

The function uses the given clustering from the second input, which relates documents to topics. The first input relates documents to terms. From those two relations (documents->topics, documents->terms), the function calculates a distribution of documents to topics, and a distribution of topics to terms. These distributions are used to find the most relevant terms per topic through pyLDAvis again (see their original paper on how they find relevant terms).

top_words_per_document

top_words_per_document(s_document_term: DocumentTermDF, n_words=5) -> TokenSeries

Very similar to top_words_per_topic, only that every document is treated as one topic/cluster so pyLDAvis finds relevant words ("keywords") that are characteristic for a document.

Showcase / Example

See this notebook for examples for this PR

mk2510 avatar Aug 25 '20 11:08 mk2510

As discussed, we'll do some more work on this

henrifroese avatar Aug 29 '20 16:08 henrifroese

@jbesomi We implemented the suggested changes. Big functions visualize_topics and top_words_per_document are now split up into multiple smaller ones, so the user of the library now has to write the pipeline of functions by himself. The new functions include:

topic_matrices

def topic_matrices(s_document_term: pd.DataFrame, s_document_topic: pd.Series):

Get a DocumentTopic Matrix and a TopicTerm Matrix (both as Dataframes) from a DocumentTerm Matrix and a DocumentTopic Matrix. Both these matrices (the first one relating documents to terms and the second one relating documents to topics) are used to generate a DocumentTopic Matrix (relating documents to topics) and a TopicTerm Matrix (relating topics to terms).

relevant_words_per_topic

def relevant_words_per_topic(s_document_term, s_document_topic_distribution, s_topic_term_distribution, n_words=10, return_figure=False):

Use LDAvis to find the most relevant words for each topic. This function uses the three given relations (documents->terms, documents->topics, topics->terms) to find and return the most relevant words for each topic. The pyLDAvis library is used to find relevant words.

We left the two functions visualize_topics and relevant_words_per_document inside the library as we think, that there will be a lot of users, especially ML specialist, who just want to have a short glimpse of their data and not understand the whole algorithm. However, in the docstring we mentioned, that those functions are just a pipeline wrapper like we did it with the clean function in preprocessing.

For relevant_words_per_document the pipeline could look like:

>>> # New Series where every document is its own cluster.
>>> s_cluster = pd.Series(
...    np.arange(len(s_document_term)), index=s_document_term.index, dtype="category")  
>>> s_document_topic, s_topic_term = hero.topic_matrices(s_document_term, s_cluster) 
>>> s_document_topic_distribution = hero.normalize(s_document_topic, norm="l1")
>>> s_topic_term_distribution = hero.normalize(s_topic_term, norm="l1")  
>>> relevant_words_per_topic(
...  s_document_term,
...  s_document_topic_distribution,
...  s_topic_term_distribution)  # doctest: +SKIP

and for visualize_topics a suggested hero-pipeline would look like this:

>>> import pyLDAvis  # doctest: +SKIP
>>> s_document_topic, s_topic_term = hero.topic_matrices(s_document_term, s_document_topic) 
>>> s_document_topic_distribution = hero.normalize(s_document_topic, norm="l1") 
>>> s_topic_term_distribution = hero.normalize(s_topic_term, norm="l1") 
>>> figure = hero.relevant_words_per_topic(s_document_term, s_document_topic_distribution, s_topic_term_distribution, return_figure=True) 
>>> pyLDAvis.show(figure)

The setup file and the travis.yml file were edited in the same way as in PR #171 to ensure a black version 19 in order to have all docktest going 🏁

mk2510 avatar Aug 30 '20 16:08 mk2510

Looks great! 🎉 🎉 👍

Can you please provide a short Google Cloab notebook that shows a working pipeline with all the different functions integrated?

Do these functions works also on a large dataset? Up until which size?

jbesomi avatar Sep 01 '20 09:09 jbesomi

@jbesomi we created a short notebook, where we display the functionality of those two pipelines. 🐰 We think those two will be the main use cases of the implemented functions. The third 🥉 use case, finding relevant words per topic and including them in a data frame is just a version of the second algorithm, but with the documents clustered with a clustering algorithm like kmeans or assigned to a topic with LSA/LDA.

When those functions will be ready to merge, we will prepare an exhaustive tutorial for the user to introduce them to the Topic Modeling 💯

mk2510 avatar Sep 05 '20 20:09 mk2510

For now, reviewed only lda, see comments below

jbesomi avatar Sep 08 '20 11:09 jbesomi

Thanks for the review! As I commented above, we'll have to go through this again anyway once #156 is merged :pray: .

henrifroese avatar Sep 12 '20 09:09 henrifroese

~#156 has been merged; can you please go through it again?~ => let's wait for #157 to be merged

jbesomi avatar Sep 14 '20 15:09 jbesomi

we also have updated this branch, so it is now sourced from the master 🥳 it is now ready to be reviewed or merged 🦀 🤞

mk2510 avatar Sep 22 '20 10:09 mk2510

This would be a very useful feature. Any pending blockers or any expected date for merging and releasing?

kepler avatar Apr 01 '21 11:04 kepler

Hey @kepler Yes, the plan is to merge this PR soon. But first, the idea is to release a new version with the HeroSeries (to introduce and explain the concept). After that, we will be able to merge this one.

The remaining step for the HeroSeries is:

  1. make sure each function makes correctly use of the HeroSeries and test it (TODO, we need to open an issue)
  2. finalize the documentation for the HeroSeries (#117, #118, #135)

jbesomi avatar Apr 04 '21 14:04 jbesomi

Hello, do you have any news on this topic or when it will be released? Thanks :)

bcornet1 avatar Aug 13 '21 08:08 bcornet1

Hi, is there any news on when this PR will be implemented?

havardl avatar Mar 17 '22 21:03 havardl