umap icon indicating copy to clipboard operation
umap copied to clipboard

How to project numerical and categorical data?

Open RaulRomaniF opened this issue 7 years ago • 5 comments

I want to project the Titanic dataset it contains categorical and numerical data?

I heard in this video (min 48:08), basically says that UMAP can combine multiple data types, so the question is how?

One approach would be to project numerical data only and then categorical data only and finally combine them in the same space. But Is that approach the way to go?

Thank you for your time.

RaulRomaniF avatar Aug 05 '18 05:08 RaulRomaniF

This can be done in theory; in practice I am still working on the code to do this, so it isn't available in the repository yet. This may not be the answer you are looking for. As in interim step you can check issue #58 which provides a simple recipe to do this in straightforward cases.

On Sun, Aug 5, 2018 at 1:08 AM romanics [email protected] wrote:

I want to project the Titanic dataset https://www.kaggle.com/c/titanic/data it contains categorical and numerical data?

I heard in this video https://www.youtube.com/watch?v=YPJQydzTLwQ (min 48:08), basically says that UMAP can combine multiple data types, so the question is how?

One approach would be to project numerical data only and then categorical data only and finally combine them in the same space. But Is that approach the way to go?

Thank you for your time.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/lmcinnes/umap/issues/104, or mute the thread https://github.com/notifications/unsubscribe-auth/ALaKBdRR_uPa94E19J3HWL1wO3R0LbOvks5uNn3EgaJpZM4VvPjc .

lmcinnes avatar Aug 05 '18 17:08 lmcinnes

The data with both categorical and numerical data types can be handled using gower-distance metric. You can download the code for gower distance metric from here. It might be available in coming scikit-learn release.

asdspal avatar Mar 25 '19 04:03 asdspal

While Gower distance is quite useful it is also somewhat heuristic. I would recommend exploring it as one of the options for handling mixed continuous and categorical data.

lmcinnes avatar Mar 25 '19 15:03 lmcinnes

Is it possible to create indicator variables from categorical variables.

acilingi avatar Dec 24 '21 19:12 acilingi

One approach is pd.get_dummies, but you may also want to look at the dirty-cat library for richer options.

lmcinnes avatar Dec 29 '21 17:12 lmcinnes