query_topics unexpected results
When I get the topic for document 0 after building a model from 6000 documents like this:
model.get_documents_topics(model.document_ids[0:1], reduced=False)[0]
It will return 26 for the topic.
When I run query_topics on the same text as document 0 like this:
_, _, scores, topics = model.query_topics(text, 5, reduced=False)
I get this result:
(array([0.22279967, 0.18875876, 0.18315801, 0.17984378, 0.16473249],
dtype=float32),
array([56, 74, 54, 14, 26]))
Running query_topics again will give this result:
(array([0.20202185, 0.1844944 , 0.18084005, 0.17778644, 0.17015621],
dtype=float32),
array([44, 56, 54, 74, 26]))
I was expecting that if I run query_topics on the same text that the model was built with I would get the same topic to be the highest scoring prediction and I was expecting the results would not change each time it was called. Am I doing something wrong or is this the correct behavior?
I just tried to and was unable to replicate this behaviour. Can you tell me which embedding_model you use and any more information?
I'm using the default 'doc2vec' embedding_model with speed set to 'deep-learn'. Top2Vec version is 1.0.26. Python 3.9.10.
Here's an example
import sys
import pandas as pd
import multiprocessing
from pathlib import Path
from top2vec import Top2Vec # top2vec 1.0.26
from tqdm.contrib.concurrent import process_map
print(sys.version)
dfs = []
for year in range(2008, 2019):
file_name = f'{year}-projects.xls'
data_url = f'https://iacc.hhs.gov/funding/data/{file_name}'
df = pd.read_excel(data_url)
if year == '2018':
# fix the FISCAL YEAR and PROJECT NUMBER cols
df.rename(
columns={
'FISCAL YEAR': 'PROJECT NUMBER',
'PROJECT NUMBER':'FISCAL YEAR'
},
inplace=True
)
dfs.append(df)
df = pd.concat(dfs, axis=0)
print(df.shape)
df['raw_topic_text'] = df['PROJECT TITLE'] + ' ' + df['PROJECT DESCRIPTION']
print(f'Duplicated: {df.duplicated("raw_topic_text").sum()}')
topic_df = df.drop_duplicates('raw_topic_text')[['raw_topic_text']].dropna()
topic_df.reset_index(inplace=True, drop=True)
print(topic_df.shape)
def build_model(docs: list, model_filename: str, use_cache: bool = True) -> Top2Vec:
if not Path(model_filename).exists() or not use_cache:
model = Top2Vec(documents=docs, speed="deep-learn", workers=multiprocessing.cpu_count())
model.save(model_filename)
return Top2Vec.load(model_filename)
model_filename = '/data/models/top2vec_deep-learn-preprocess-' + 'iacc.hhs.gov-us-test'
print(model_filename)
model = build_model(topic_df['raw_topic_text'].tolist(), model_filename, use_cache=False)
print(f'N topics: {model.get_num_topics()}')
model.hierarchical_topic_reduction(20)
topic_df['topic'] = model.get_documents_topics(model.document_ids, reduced=True)[0]
text = topic_df.iloc[2]['raw_topic_text']
for i in range(20):
_, _, topic_scores, topic_nums = model.query_topics(text, model.get_num_topics(reduced=True), reduced=True)
# topic numbers change order on repeated calls
print(topic_nums[:10])
text = topic_df.iloc[2]['raw_topic_text']
for i in range(20):
_, _, topic_scores, topic_nums = model.query_topics(text, model.get_num_topics(reduced=True), reduced=True)
# topic scores are different on each call
print(topic_scores.tolist()[:5])
def query_topics(text: str) -> str:
_, _, _, topic_nums = model.query_topics(text, model.get_num_topics(reduced=True), reduced=True)
return topic_nums[0]
results_ls = process_map(query_topics, topic_df['raw_topic_text'].tolist(), chunksize=1)
# topic from get_documents_topics are sometimes different than query_topics
print(f'N different topics: {sum(topic_df["topic"] != results_ls)}') # expecting 0
Perhaps try with the new version 1.0.27. Without knowing your specific dataset its hard to debug. If you could perhaps recreate this with a known dataset like 20newsgroups I can help further.
The data set is in the example.
Oh ok sorry I didn't catch the link, ok I will try and have a look.