umap icon indicating copy to clipboard operation
umap copied to clipboard

Norminal and Ordinal input variables

Open cwk20 opened this issue 3 years ago • 3 comments

It will be very helpful to include an example in UMAP documentation about how to handle nominal and ordinal input variables. I come across a number of input variables like gender which can simply be one-hot encoded, but there are ones like "When did you last take anitibiotics?" One can, for example, choose, among 1 week ago, 2 weeks ago, 1 month ago, .. or none. There is an intrinsic order here. Do I simply one-hot encode them, in which case there will be a loss of information regarding the order.

Can I input all these together to UMAP using a relevant metric like hamming? I have left out the continuous input variables like height and weight as there is no off-the-shelf method available that can handle both continuous and categorical data together at the present time.

The examples in the UMAP documentation are mostly on input variables which are continuous. An example on nominal and ordinal input variables will be very much appreciated. Thank you for your help.

cwk20 avatar Sep 02 '22 05:09 cwk20

Really UMAP is primarily good at dealing with continuous variables, or at the very least numeric vectors for which a sensible distance metric exists or can be defined. The problem of converting a dataset into numeric vectors with a sensible distance metric on them is somewhat out of scope in that it is a huge topic. If you would like to provide a good example to add to the documentation that would be appreciated, but I feel like there are entire packages (dirty-cat, vectorizers, etc.) devoted to this task, and I would rather leave the complex cases in their hands.

lmcinnes avatar Sep 02 '22 14:09 lmcinnes

Thank you for your comments. If something like Gower metric could be included in the UMAP, we will have a handle on mixed-data type. In areas like health & medicine, we come across patient surveys with many questions which are categorical in nature (e.g., smoker / non-smoker, suffered from a certain disease type yes /no, ... etc) together with standard questions like weight and height which are of real values. I would say more categorical types than continuous. I guess such a functionality (e.g., Gower metric) will enhance the scope of UMAP and facilitate research in health & medicine. I am more than happy to provide a good example if I can find one good enough to go on the documentation. Perhaps some readers can provide good examples if they know any. But first, I need to get umap.plot to work which I have raised as a previous issue (#906). This may be something to do with the version control. Thanks again for all your help.

cwk20 avatar Sep 05 '22 01:09 cwk20

There is an open-source clustering package for mixed data type called DenseClus which uses both UMAP and HDBSCAN. https://aws.amazon.com/ko/blogs/opensource/introducing-denseclus-an-open-source-clustering-package-for-mixed-type-data/

cwk20 avatar Sep 20 '22 01:09 cwk20