sparse_dot_topn
sparse_dot_topn copied to clipboard
did you compare with gensim
did you compare with gensim as stated in https://stats.stackexchange.com/questions/61085/cosine-similarity-on-sparse-matrix I am using gensim, which works pretty well especially with text data which is usually high dimensional and sparse
Hi,
Could you tell us the size of your matrices, the number non empty elements in both, the number of top results, your number of cpu, and the execution time using gensim. We can then easily compare.
Thank you in advance
On Fri, Jul 1, 2022, 18:01 Sandy4321 @.***> wrote:
did you compare with gensim as stated in
https://stats.stackexchange.com/questions/61085/cosine-similarity-on-sparse-matrix I am using gensim https://radimrehurek.com/gensim/, which works pretty well especially with text data which is usually high dimensional and sparse
— Reply to this email directly, view it on GitHub https://github.com/ing-bank/sparse_dot_topn/issues/70, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJBFDC3S7ZF6IARRTWNQYDVR4I7JANCNFSM52NFI3KA . You are receiving this because you are subscribed to this thread.Message ID: @.***>
just idea i do not have data, sorry may be to use textual data converted to one hot ? import sys print('python version is ', sys.version) print('path to python exe ' , sys.executable) import sklearn print('sklearn version' , sklearn.version) from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics import confusion_matrix from sklearn.linear_model import RidgeClassifier import numpy as np
categories = [
"comp.graphics",
"sci.space",
]
from sklearn.datasets import fetch_20newsgroups
import re
def load_dataset(verbose=False, remove=()):
"""Load and vectorize the 20 newsgroups dataset."""
q = 0
data_train = fetch_20newsgroups(
subset="train",
categories=categories,
shuffle=True,
random_state=42,
remove=remove,
)
q = 0
data_test = fetch_20newsgroups(
subset="test",
categories=categories,
shuffle=True,
random_state=42,
remove=remove,
)
q = 0
return data_train.data, data_test.data , data_train.target, data_test.target, data_train.target_names
q = 0
X_train, X_test, y_train, y_test, target_names = load_dataset( verbose=True )
q = 0
with open('y_test.txt', 'w') as f:
for line in y_test:
f.write(f"{line}\n")
with open('y_train.txt', 'w') as f:
for line in y_train:
f.write(f"{line}\n")
with open('X_test.txt', 'w') as f:
for line in X_test:
line = re.sub('[!@#$]', '', line)
clean_line = line.replace('\r', '').replace('\n', '')
f.write(f"{clean_line}\n")
with open('X_train.txt', 'w') as f:
for line in X_train:
line = re.sub('[!@#$]', '', line)
clean_line = line.replace('\r', '').replace('\n', '')
f.write(f"{clean_line}\n")
then #preprocessing vectorizer = TfidfVectorizer( sublinear_tf=True, max_df=0.5, min_df=5, stop_words="english" ) X_train_vectorised = vectorizer.fit_transform(X_train)
this way you do have a lot of spares rows in X_train_vectorised to compare each to other
or use https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html instead of TfidfVectorizer