umap
umap copied to clipboard
[Error] What if I want to apply embedding_column to category variables in Parametric UMAP?
When reducing dimensions from tabular data, I want to reduce dimensions by using an embedding method for category variables. However, I don't think this method is possible with the input form provided now(only supported array). How can I solve this?
import pandas as pd
import numpy as np
import tensorflow as tf
from sklearn.preprocessing import LabelEncoder
df = pd.read_csv("./data/adult.csv")
categorical_columns = list(df.select_dtypes("object"))
continuous_columns = [i for i in list(df) if i not in categorical_columns]
feature_columns = []
inputs = {}
for col in continuous_columns :
feature_columns.append( tf.feature_column.numeric_column(col))
inputs[col] = tf.keras.Input(shape=(1,), name=col, dtype=tf.dtypes.float32)
for col in categorical_columns :
le = LabelEncoder()
le.fit(df[col])
df[col] = le.transform(df[col])
#df[col] = df[col].astype(np.float32)
unique_v = df[col].unique().tolist()
_col = tf.feature_column.categorical_column_with_vocabulary_list(col,unique_v)
_emb = tf.feature_column.embedding_column(_col,4)
feature_columns.append(_emb)
inputs[col] = tf.keras.Input(shape=(1,), name=col, dtype=tf.dtypes.int32)
feature_layer = tf.keras.layers.DenseFeatures(feature_columns)
n_components = 2
x = feature_layer(inputs)
x = tf.keras.layers.Dense(128, activation='relu')(x)
x = tf.keras.layers.Dense(128, activation='relu')(x)
x = tf.keras.layers.Dropout(.1)(x)
out = tf.keras.layers.Dense(n_components)(x)
encoder = tf.keras.Model(inputs=dict(inputs), outputs=out)
encoder.summary()
from umap.parametric_umap import ParametricUMAP
embedder = ParametricUMAP(encoder=encoder, dims=(15,))
embedding = embedder.fit_transform(dict(df))
Is the dataset to large to input the dense data (feature_layer) rather than make it part of the dataframe?
If so it seems like you would need to build the graph and do the embedding separately, where graph building would use the same method as non-parametric umap, then the next step would use that graph and your sparse dataset as input to UMAP
There's a notebook here on extending the model and breaking out these two steps: https://colab.research.google.com/drive/1WkXVZ5pnMrm17m0YgmtoNjM_XHdnE5Vp?usp=sharing
thank you. I'll try again using the link you sent me!