Top2Vec icon indicating copy to clipboard operation
Top2Vec copied to clipboard

query_topics unexpected results

Open chrisfleisch opened this issue 3 years ago • 5 comments

When I get the topic for document 0 after building a model from 6000 documents like this:

model.get_documents_topics(model.document_ids[0:1], reduced=False)[0]

It will return 26 for the topic.

When I run query_topics on the same text as document 0 like this:

_, _, scores, topics = model.query_topics(text, 5, reduced=False)

I get this result:

(array([0.22279967, 0.18875876, 0.18315801, 0.17984378, 0.16473249],
       dtype=float32),
 array([56, 74, 54, 14, 26]))

Running query_topics again will give this result:

(array([0.20202185, 0.1844944 , 0.18084005, 0.17778644, 0.17015621],
       dtype=float32),
 array([44, 56, 54, 74, 26]))

I was expecting that if I run query_topics on the same text that the model was built with I would get the same topic to be the highest scoring prediction and I was expecting the results would not change each time it was called. Am I doing something wrong or is this the correct behavior?

chrisfleisch avatar Feb 25 '22 18:02 chrisfleisch

I just tried to and was unable to replicate this behaviour. Can you tell me which embedding_model you use and any more information?

ddangelov avatar Apr 03 '22 22:04 ddangelov

I'm using the default 'doc2vec' embedding_model with speed set to 'deep-learn'. Top2Vec version is 1.0.26. Python 3.9.10.

Here's an example

import sys
import pandas as pd
import multiprocessing

from pathlib import Path
from top2vec import Top2Vec  # top2vec 1.0.26
from tqdm.contrib.concurrent import process_map


print(sys.version)


dfs = []

for year in range(2008, 2019):
    file_name = f'{year}-projects.xls'
    data_url = f'https://iacc.hhs.gov/funding/data/{file_name}'
    df = pd.read_excel(data_url)

    if year == '2018':
        # fix the FISCAL YEAR and PROJECT NUMBER cols
        df.rename(
            columns={
                'FISCAL YEAR': 'PROJECT NUMBER',
                'PROJECT NUMBER':'FISCAL YEAR'
            },
            inplace=True
        )
    dfs.append(df)

df = pd.concat(dfs, axis=0)


print(df.shape)

df['raw_topic_text'] = df['PROJECT TITLE'] + ' ' + df['PROJECT DESCRIPTION']

print(f'Duplicated: {df.duplicated("raw_topic_text").sum()}')

topic_df = df.drop_duplicates('raw_topic_text')[['raw_topic_text']].dropna()
topic_df.reset_index(inplace=True, drop=True)

print(topic_df.shape)

def build_model(docs: list, model_filename: str, use_cache: bool = True) -> Top2Vec:
    if not Path(model_filename).exists() or not use_cache:
        model = Top2Vec(documents=docs, speed="deep-learn", workers=multiprocessing.cpu_count())
        model.save(model_filename)
    return Top2Vec.load(model_filename)


model_filename = '/data/models/top2vec_deep-learn-preprocess-' + 'iacc.hhs.gov-us-test'
print(model_filename)
model = build_model(topic_df['raw_topic_text'].tolist(), model_filename, use_cache=False)


print(f'N topics: {model.get_num_topics()}')


model.hierarchical_topic_reduction(20)


topic_df['topic'] = model.get_documents_topics(model.document_ids, reduced=True)[0]


text = topic_df.iloc[2]['raw_topic_text']
for i in range(20):
    _, _, topic_scores, topic_nums = model.query_topics(text, model.get_num_topics(reduced=True), reduced=True)
    # topic numbers change order on repeated calls
    print(topic_nums[:10])



text = topic_df.iloc[2]['raw_topic_text']
for i in range(20):
    _, _, topic_scores, topic_nums = model.query_topics(text, model.get_num_topics(reduced=True), reduced=True)
    # topic scores are different on each call
    print(topic_scores.tolist()[:5])


def query_topics(text: str) -> str:
    _, _, _, topic_nums = model.query_topics(text, model.get_num_topics(reduced=True), reduced=True)
    return topic_nums[0]

results_ls = process_map(query_topics, topic_df['raw_topic_text'].tolist(), chunksize=1)


# topic from get_documents_topics are sometimes different than query_topics
print(f'N different topics: {sum(topic_df["topic"] != results_ls)}') # expecting 0

chrisfleisch avatar Apr 04 '22 19:04 chrisfleisch

Perhaps try with the new version 1.0.27. Without knowing your specific dataset its hard to debug. If you could perhaps recreate this with a known dataset like 20newsgroups I can help further.

ddangelov avatar Apr 04 '22 20:04 ddangelov

The data set is in the example.

chrisfleisch avatar Apr 04 '22 20:04 chrisfleisch

Oh ok sorry I didn't catch the link, ok I will try and have a look.

ddangelov avatar Apr 04 '22 20:04 ddangelov