umap icon indicating copy to clipboard operation
umap copied to clipboard

[Error] What if I want to apply embedding_column to category variables in Parametric UMAP?

Open sungreong opened this issue 3 years ago • 2 comments

When reducing dimensions from tabular data, I want to reduce dimensions by using an embedding method for category variables. However, I don't think this method is possible with the input form provided now(only supported array). How can I solve this?

data links

import pandas as pd
import numpy as np
import tensorflow as tf
from sklearn.preprocessing import LabelEncoder


df = pd.read_csv("./data/adult.csv")
categorical_columns = list(df.select_dtypes("object"))
continuous_columns = [i  for i in list(df) if i not in categorical_columns]
feature_columns = []
inputs = {}

for col in  continuous_columns :
    feature_columns.append( tf.feature_column.numeric_column(col))
    inputs[col] = tf.keras.Input(shape=(1,), name=col, dtype=tf.dtypes.float32) 
for col in categorical_columns :
    le = LabelEncoder()
    le.fit(df[col])
    df[col] = le.transform(df[col])
    #df[col] = df[col].astype(np.float32)
    unique_v = df[col].unique().tolist()
    _col = tf.feature_column.categorical_column_with_vocabulary_list(col,unique_v)
    _emb = tf.feature_column.embedding_column(_col,4)
    feature_columns.append(_emb)
    inputs[col] = tf.keras.Input(shape=(1,), name=col, dtype=tf.dtypes.int32) 
feature_layer = tf.keras.layers.DenseFeatures(feature_columns)

n_components = 2
x = feature_layer(inputs)
x = tf.keras.layers.Dense(128, activation='relu')(x)
x = tf.keras.layers.Dense(128, activation='relu')(x)
x = tf.keras.layers.Dropout(.1)(x)
out = tf.keras.layers.Dense(n_components)(x)
encoder = tf.keras.Model(inputs=dict(inputs), outputs=out)
encoder.summary()

from umap.parametric_umap import ParametricUMAP
embedder = ParametricUMAP(encoder=encoder, dims=(15,))
embedding = embedder.fit_transform(dict(df))

sungreong avatar Jan 08 '22 05:01 sungreong

Is the dataset to large to input the dense data (feature_layer) rather than make it part of the dataframe?

If so it seems like you would need to build the graph and do the embedding separately, where graph building would use the same method as non-parametric umap, then the next step would use that graph and your sparse dataset as input to UMAP

There's a notebook here on extending the model and breaking out these two steps: https://colab.research.google.com/drive/1WkXVZ5pnMrm17m0YgmtoNjM_XHdnE5Vp?usp=sharing

timsainb avatar Jan 09 '22 22:01 timsainb

thank you. I'll try again using the link you sent me!

sungreong avatar Jan 15 '22 04:01 sungreong