gensim
gensim copied to clipboard
Parameter shardsize ignored on queries
Problem description
When I use the shardsize
parameter in the similarities.Similarity
method, when querying the index the same parameter is not used, causing errors:
self._similarity_index = similarities.Similarity(MODELS_PATH + f'/{model}', sim_vectors, num_features=len(self._dictionary), shardsize=50000)
sims = self._similarity_index[doc_vector]
PS: If I don't use the parameter shardsize
, the error already occurs in the similarities.Similarity
call.
Steps/code/corpus to reproduce
Save the .py
files in the pruvo
folder (package), the .parquet
file in data
folder and run this script:
import pandas as pd
from pruvo.embedding import Corpus
df = pd.read_parquet('data/preprocess.parquet')
corpus = Corpus()
corpus.add(list(df['bookingRoomType'].unique()), pre_processed=True)
corpus.add(list(df['mappedRoomType'].unique()), pre_processed=True)
w2v = corpus.train(model='word2vec')
w2v_similars = corpus.get_similars('apartment 1 king bed in neverland')
w2v_similars.head(10)
Versions
Please provide the output of:
import platform; print(platform.platform())
import sys; print("Python", sys.version)
import struct; print("Bits", 8 * struct.calcsize("P"))
import numpy; print("NumPy", numpy.__version__)
import scipy; print("SciPy", scipy.__version__)
import gensim; print("gensim", gensim.__version__)
from gensim.models import word2vec;print("FAST_VERSION", word2vec.FAST_VERSION)