sparse_dot_topn icon indicating copy to clipboard operation
sparse_dot_topn copied to clipboard

did you compare with gensim

Open Sandy4321 opened this issue 3 years ago • 1 comments

did you compare with gensim as stated in https://stats.stackexchange.com/questions/61085/cosine-similarity-on-sparse-matrix I am using gensim, which works pretty well especially with text data which is usually high dimensional and sparse

Sandy4321 avatar Jul 01 '22 16:07 Sandy4321

Hi,

Could you tell us the size of your matrices, the number non empty elements in both, the number of top results, your number of cpu, and the execution time using gensim. We can then easily compare.

Thank you in advance

On Fri, Jul 1, 2022, 18:01 Sandy4321 @.***> wrote:

did you compare with gensim as stated in

https://stats.stackexchange.com/questions/61085/cosine-similarity-on-sparse-matrix I am using gensim https://radimrehurek.com/gensim/, which works pretty well especially with text data which is usually high dimensional and sparse

— Reply to this email directly, view it on GitHub https://github.com/ing-bank/sparse_dot_topn/issues/70, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJBFDC3S7ZF6IARRTWNQYDVR4I7JANCNFSM52NFI3KA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

stephanecollot avatar Jul 02 '22 07:07 stephanecollot

just idea i do not have data, sorry may be to use textual data converted to one hot ? import sys print('python version is ', sys.version) print('path to python exe ' , sys.executable) import sklearn print('sklearn version' , sklearn.version) from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics import confusion_matrix from sklearn.linear_model import RidgeClassifier import numpy as np

categories = [ "comp.graphics", "sci.space", ] from sklearn.datasets import fetch_20newsgroups import re def load_dataset(verbose=False, remove=()): """Load and vectorize the 20 newsgroups dataset.""" q = 0 data_train = fetch_20newsgroups( subset="train", categories=categories, shuffle=True, random_state=42, remove=remove, ) q = 0 data_test = fetch_20newsgroups( subset="test", categories=categories, shuffle=True, random_state=42, remove=remove, ) q = 0 return data_train.data, data_test.data , data_train.target, data_test.target, data_train.target_names q = 0 X_train, X_test, y_train, y_test, target_names = load_dataset( verbose=True ) q = 0 with open('y_test.txt', 'w') as f: for line in y_test: f.write(f"{line}\n")
with open('y_train.txt', 'w') as f: for line in y_train: f.write(f"{line}\n")

with open('X_test.txt', 'w') as f:
    for line in X_test:
        line = re.sub('[!@#$]', '', line)            
        clean_line =  line.replace('\r', '').replace('\n', '')
        f.write(f"{clean_line}\n")    
with open('X_train.txt', 'w') as f:
    for line in X_train:
        line = re.sub('[!@#$]', '', line)            
        clean_line =  line.replace('\r', '').replace('\n', '')
        f.write(f"{clean_line}\n")

then #preprocessing vectorizer = TfidfVectorizer( sublinear_tf=True, max_df=0.5, min_df=5, stop_words="english" ) X_train_vectorised = vectorizer.fit_transform(X_train)

this way you do have a lot of spares rows in X_train_vectorised to compare each to other

or use https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html instead of TfidfVectorizer

Sandy4321 avatar Aug 04 '23 01:08 Sandy4321