sparse_dot_topn icon indicating copy to clipboard operation
sparse_dot_topn copied to clipboard

are you planning to develop code for binary matrix ( values only 0 or 1)

Open Sandy4321 opened this issue 2 years ago • 8 comments

what is about binary matrix ? are you planning to develop code for binary matrix ( values only 0 or 1)

Sandy4321 avatar Jul 01 '22 16:07 Sandy4321

It should work already if you cast them before hands to np.float32 (for 32 bits).

We could implement for bool types (1 bit), and maybe get a smaller memory footprint and performance boost. Let us known your use case.

stephanecollot avatar Jul 03 '22 18:07 stephanecollot

great news thank you so much one hot data is the sparse data with ( values only 0 or 1) asked example is https://scikit-learn.org/stable/auto_examples/text/plot_document_classification_20newsgroups.html only you do need to use count vectorizer output tp feed to one hot

full example is here https://towardsdatascience.com/natural-language-processing-count-vectorization-with-scikit-learn-e7804269bb5e main lines

Or if we wanted to get the vector for one word:

print('Hot vector: ') print(vectorizer.transform(['hot']).toarray()) or simple example here to get one hot https://www.ritchieng.com/machinelearning-one-hot-encoding/

Thanks again may you do at asap pls

Sandy4321 avatar Jul 06 '22 20:07 Sandy4321

any updated pls

Sandy4321 avatar Jul 11 '22 21:07 Sandy4321

It works for binary matrices if you cast them, see my first message. If it doesn't, explain why and the give full details and size of your problem.

stephanecollot avatar Jul 26 '22 15:07 stephanecollot

Cast to what to binary type Only zero and ones One bit values? My guess for code as it is for now It can not be done... As you wrote about We could implement for bool types (1 bit), and maybe get a smaller memory footprint and performance boost.

Sandy4321 avatar Jul 31 '22 18:07 Sandy4321

@Sandy4321 bool types are not 1 bit but one byte as that is the smallest addressable unit for CPUs. You can of course pack bits into other types and there is vector<bool> which may pack bools but that is implementation dependent and could save space but not sure it will give a speedup.

RUrlus avatar Aug 01 '22 06:08 RUrlus

Great Then let's do at least byte size data? And of cause sparse format data Huge ram saving!

Sandy4321 avatar Aug 01 '22 11:08 Sandy4321

some ideas you can try 8 bits number https://arxiv.org/abs/2208.07339 https://huggingface.co/blog/hf-bitsandbytes-integration

Sandy4321 avatar Aug 18 '22 19:08 Sandy4321

Closing due to inactivity

RUrlus avatar Jan 31 '24 15:01 RUrlus