How to project numerical and categorical data?
I want to project the Titanic dataset it contains categorical and numerical data?
I heard in this video (min 48:08), basically says that UMAP can combine multiple data types, so the question is how?
One approach would be to project numerical data only and then categorical data only and finally combine them in the same space. But Is that approach the way to go?
Thank you for your time.
This can be done in theory; in practice I am still working on the code to do this, so it isn't available in the repository yet. This may not be the answer you are looking for. As in interim step you can check issue #58 which provides a simple recipe to do this in straightforward cases.
On Sun, Aug 5, 2018 at 1:08 AM romanics [email protected] wrote:
I want to project the Titanic dataset https://www.kaggle.com/c/titanic/data it contains categorical and numerical data?
I heard in this video https://www.youtube.com/watch?v=YPJQydzTLwQ (min 48:08), basically says that UMAP can combine multiple data types, so the question is how?
One approach would be to project numerical data only and then categorical data only and finally combine them in the same space. But Is that approach the way to go?
Thank you for your time.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/lmcinnes/umap/issues/104, or mute the thread https://github.com/notifications/unsubscribe-auth/ALaKBdRR_uPa94E19J3HWL1wO3R0LbOvks5uNn3EgaJpZM4VvPjc .
The data with both categorical and numerical data types can be handled using gower-distance metric. You can download the code for gower distance metric from here. It might be available in coming scikit-learn release.
While Gower distance is quite useful it is also somewhat heuristic. I would recommend exploring it as one of the options for handling mixed continuous and categorical data.
Is it possible to create indicator variables from categorical variables.
One approach is pd.get_dummies, but you may also want to look at the dirty-cat library for richer options.