sparse_dot_topn icon indicating copy to clipboard operation
sparse_dot_topn copied to clipboard

Speed up populating pandas dataframe from ndarrays and csr matrix

Open cibic89 opened this issue 6 years ago • 2 comments

I have a suggestion if you are interested

cibic89 avatar Nov 07 '19 09:11 cibic89

@cibic89 thanks! Could you please elaborate it a bit?

ymwdalex avatar Nov 13 '19 09:11 ymwdalex

@cibic89 thanks! Could you please elaborate it a bit?

You have this:

def get_matches_df(sparse_matrix, name_vector, top=100):
    non_zeros = sparse_matrix.nonzero()
    
    sparserows = non_zeros[0]
    sparsecols = non_zeros[1]
    
    if top:
        nr_matches = top
    else:
        nr_matches = sparsecols.size
    
    left_side = np.empty([nr_matches], dtype=object)
    right_side = np.empty([nr_matches], dtype=object)
    similairity = np.zeros(nr_matches)
    
    for index in range(0, nr_matches):
        left_side[index] = name_vector[sparserows[index]]
        right_side[index] = name_vector[sparsecols[index]]
        similairity[index] = sparse_matrix.data[index]
    
    return pd.DataFrame({'left_side': left_side,
                          'right_side': right_side,
                           'similairity': similairity})

Instead of using a for loop to populate you can take the results directly and add them as a series in a dataframe

matches = awesome_cossim_topn(tfidf_matrix, tfidf_matrix.transpose(), 100, 0, use_threads = True, n_jobs = 4)

non_zeros = matches.nonzero()
sparserows = non_zeros[0]
sparsecols = non_zeros[1]

matches_df = pd.DataFrame()
matches_df["left_side"] = name_series.iloc[sparserows].values.astype("category")
matches_df["right_side"] = name_series.iloc[sparsecols].values.astype("category")
matches_df["similarity"] = pd.to_numeric(matches.data, downcast = "float")

cibic89 avatar Nov 13 '19 10:11 cibic89

Thanks @cibic89. Closing as now outdated

RUrlus avatar Jan 31 '24 15:01 RUrlus