gensim icon indicating copy to clipboard operation
gensim copied to clipboard

Parameter shardsize ignored on queries

Open MaickelHubner opened this issue 2 years ago • 0 comments

Problem description

When I use the shardsize parameter in the similarities.Similarity method, when querying the index the same parameter is not used, causing errors:

self._similarity_index = similarities.Similarity(MODELS_PATH + f'/{model}', sim_vectors, num_features=len(self._dictionary), shardsize=50000)

sims = self._similarity_index[doc_vector]

image

PS: If I don't use the parameter shardsize, the error already occurs in the similarities.Similarity call.

Steps/code/corpus to reproduce

Save the .py files in the pruvo folder (package), the .parquet file in data folder and run this script:

import pandas as pd

from pruvo.embedding import Corpus

df = pd.read_parquet('data/preprocess.parquet')

corpus = Corpus()
corpus.add(list(df['bookingRoomType'].unique()), pre_processed=True)
corpus.add(list(df['mappedRoomType'].unique()), pre_processed=True)

w2v = corpus.train(model='word2vec')

w2v_similars = corpus.get_similars('apartment 1 king bed in neverland')
w2v_similars.head(10)

Versions

Please provide the output of:

import platform; print(platform.platform())
import sys; print("Python", sys.version)
import struct; print("Bits", 8 * struct.calcsize("P"))
import numpy; print("NumPy", numpy.__version__)
import scipy; print("SciPy", scipy.__version__)
import gensim; print("gensim", gensim.__version__)
from gensim.models import word2vec;print("FAST_VERSION", word2vec.FAST_VERSION)

image

files.zip

MaickelHubner avatar Nov 30 '22 10:11 MaickelHubner