OCTIS icon indicating copy to clipboard operation
OCTIS copied to clipboard

There is a mismatch between `output["topic-word-matrix"]` and `dataset.get_vocabulary()` in terms of words?

Open Zay-Ben opened this issue 2 years ago • 5 comments

There is a mismatch between output["topic-word-matrix"] and dataset.get_vocabulary() in terms of words?

I created a Dataframe as follows:

df = pd.DataFrame(data = output["topic-word-matrix"], columns = dataset.get_vocabulary()).T

When I sort the Dataframe by a topic number to get the top words for a topic, why do the results differ from output["topics"][i]?

Thank you!

Zay-Ben avatar Jan 10 '23 15:01 Zay-Ben

There should be a one-to-one correspondence between the two. It's difficult to say what is wrong. Can you share more details about the problem?

silviatti avatar Feb 02 '23 09:02 silviatti

Good day Dr. Silvia, nice to see you again, and thank you for reply. Here are the details of the issue. :)

First, I created a dataset folder containing two files, namely corpus.txt and vocabulary.tsv as the OCTIS module required.

The corpus file:

image

The vocabulary file (sorted alphabetically):

image

Second, I loaded the dataset and trained LDA models with the dataset.

image

image

image

Third, after training, I imported one of the LDA models. With the model’s topic-word-matrix as the data and the dataset’s vocabulary as the column. The resulting data frame is shown in the figure below:

image

Last, the top 5 words of the data frame’s first topic are different from the top 5 words of the model’s first topic.

image

I can't determine why there are discrepancies in the top words of the topics.

With appreciation,

Benz

Zay-Ben avatar Feb 02 '23 12:02 Zay-Ben

Hi Benz, sorry for the late reply. I haven't had time to work on OCTIS these months. There's something weird, I agree. I would suggest two experiments in case you're still interesting in this issue:

  • Can you also print out dataset.get_vocabulary()? Just to see if the vocabulary match with your file.
  • Could you try to repeat the experiment with another model and see if you have the same problem? I'd like to see if the problem is only of LDA or it's general.

Thanks for your patience.

Silvia

silviatti avatar Apr 15 '23 13:04 silviatti

Dear Dr. Silvia,

Thank you for taking the time to address my questions.

Regarding the first question, the results show that the order of the vocabulary before and after importing it using OCTIS is different. The vocabulary was sorted alphabetically before importing and shuffled randomly (seemingly) after importing, as shown in the image with the first five terms of each vocabulary. image image

Regarding the second question, I trained two models (ETM and NMF) using the same dataset and found that the problem persists for NMF, but not for ETM, as shown in the figure below. I noticed that OCTIS's LDA and NMF are both from Gensim. Could this be the source of the error?

ETM: image image

NMF: image image

Just to give context, the dataset consists of tweets that contain customer complaints about telecommunication companies.

Thank you again for your help! Topic modeling has never been easy without OCTIS. 😭

Zay-Ben avatar Apr 15 '23 15:04 Zay-Ben

Hi, just to double-check, when you load the custom dataset, do you have a file in the dataset folder called vocabulary.txt? That should be the vocabulary file were words are sorted alphabetically. I asked this question because I noticed that your file is called "words.txt", so it can be possible that OCTIS doesn't load it.

Let me know :)

silviatti avatar May 03 '23 07:05 silviatti