sparse_dot_topn
sparse_dot_topn copied to clipboard
Speed up populating pandas dataframe from ndarrays and csr matrix
I have a suggestion if you are interested
@cibic89 thanks! Could you please elaborate it a bit?
@cibic89 thanks! Could you please elaborate it a bit?
You have this:
def get_matches_df(sparse_matrix, name_vector, top=100):
non_zeros = sparse_matrix.nonzero()
sparserows = non_zeros[0]
sparsecols = non_zeros[1]
if top:
nr_matches = top
else:
nr_matches = sparsecols.size
left_side = np.empty([nr_matches], dtype=object)
right_side = np.empty([nr_matches], dtype=object)
similairity = np.zeros(nr_matches)
for index in range(0, nr_matches):
left_side[index] = name_vector[sparserows[index]]
right_side[index] = name_vector[sparsecols[index]]
similairity[index] = sparse_matrix.data[index]
return pd.DataFrame({'left_side': left_side,
'right_side': right_side,
'similairity': similairity})
Instead of using a for loop to populate you can take the results directly and add them as a series in a dataframe
matches = awesome_cossim_topn(tfidf_matrix, tfidf_matrix.transpose(), 100, 0, use_threads = True, n_jobs = 4)
non_zeros = matches.nonzero()
sparserows = non_zeros[0]
sparsecols = non_zeros[1]
matches_df = pd.DataFrame()
matches_df["left_side"] = name_series.iloc[sparserows].values.astype("category")
matches_df["right_side"] = name_series.iloc[sparsecols].values.astype("category")
matches_df["similarity"] = pd.to_numeric(matches.data, downcast = "float")
Thanks @cibic89. Closing as now outdated