benchmark_VAE
benchmark_VAE copied to clipboard
VAE for tabular data for dimension reduction
Hi Clément,
Thanks for creating and maintaining this great repo. I'm a biostatistician working on environmental epidemiology (meaning that I'm new to machine learning and my questions may be naive), and I'm trying to tackle the high correlation issue with VAE or disentanglement learning.
My question is quite different (in my view) from the questions in the example code: the data in my field are tabular datasets with observations in the rows and variables in the columns (2Ds), while the example data and code in the repo are mostly images (3Ds). I'm wondering how could I set up the correct dataset form and input dimension for benchmark_VAE
to work? Please see a small example data below.
dl_train
| SO4 NO3 NH4 OM BC
--- + ---------- ---------- ---------- --------- ---------
0 | 9.75255 12.2174 7.41296 14.8118 2.77726
1 | 7.41267 9.18699 6.56743 10.5916 1.89571
2 | 7.67942 9.50747 6.8047 10.9361 1.95844
3 | 7.16214 8.52167 6.206 10.1743 1.84438
4 | 9.68588 12.1869 7.39739 14.6614 2.74922
5 | 9.65254 12.7658 8.79049 13.3749 2.51152
6 | 10.7724 13.1742 9.19811 13.9698 2.59258
7 | 9.10471 12.2183 8.55336 13.3383 2.52734
8 | 9.18762 12.3056 8.6236 13.4588 2.55033
9 | 13.2112 14.4727 10.4412 15.7128 2.99749
10 | 12.5401 15.4839 10.0014 16.7971 3.1917
11 | 7.55747 8.95752 6.55564 10.9204 1.99788
12 | 7.7435 9.82413 6.95538 11.0182 1.96674
13 | 7.32966 9.23089 6.57861 11.0256 2.03959
14 | 9.74056 12.1903 7.39755 14.7946 2.77484
… | … … … … …
995 | 9.21827 11.2173 7.82692 10.9786 2.02918
996 | 11.0007 14.519 9.44757 16.752 3.17391
997 | 7.50137 9.51056 6.79015 11.4156 2.14255
998 | 8.20999 10.6865 7.28414 11.2109 2.12318
999 | 14.7959 15.3163 10.5064 20.9706 4.47214
My aim is to reduce the y-dimension of this data set because they (SO4 , NO3, NH4, OM, BC) are highly correlated, and putting them in one model will cause the issue of variance inflation. I wonder how could I set up the right benchmark_VAE
code to achieve this aim. Currently my code looks like this:
from pythae.pipelines import TrainingPipeline
from pythae.models import VAE, VAEConfig
from pythae.trainers import BaseTrainerConfig
import numpy as np
my_training_config = BaseTrainerConfig(
output_dir='./',
num_epochs=5,
learning_rate=1e-3,
per_device_train_batch_size=200,
per_device_eval_batch_size=200,
train_dataloader_num_workers=2,
eval_dataloader_num_workers=2,
steps_saving=20,
optimizer_cls="AdamW",
optimizer_params={"weight_decay": 0.05, "betas": (0.91, 0.995)},
scheduler_cls="ReduceLROnPlateau",
scheduler_params={"patience": 5, "factor": 0.5}
)
# Set up the model configuration
my_vae_config = VAEConfig(
input_dim=(1000, 6),
latent_dim=10
)
# Build the model
my_vae_model = VAE(model_config=my_vae_config)
# Build the Pipeline
pipeline = TrainingPipeline(
training_config=my_training_config,
model=my_vae_model
)
dl_train_sample = dl_dt[0:1000,:].to_numpy()
dl_eval_sample = dl_dt[1001:2001,:].to_numpy()
# Launch the Pipeline
pipeline(
train_data=dl_train_sample, # must be torch.Tensor, np.array or torch datasets
eval_data=dl_eval_sample # must be torch.Tensor, np.array or torch datasets
)
But it reported the following error. I guess I did not set up the input datasets and input dimensions correctly. Any ideas would be appreciated.
Preprocessing train data...
INFO:pythae.pipelines.training:Preprocessing train data...
Checking train dataset...
INFO:pythae.pipelines.training:Checking train dataset...
Preprocessing eval data...
INFO:pythae.pipelines.training:Preprocessing eval data...
Checking eval dataset...
INFO:pythae.pipelines.training:Checking eval dataset...
Using Base Trainer
INFO:pythae.pipelines.training:Using Base Trainer
ModelError: Error when calling forward method from model. Potential issues:
- Wrong model architecture -> check encoder, decoder and metric architecture if you provide yours
- The data input dimension provided is wrong -> when no encoder, decoder or metric provided, a network is built automatically but requires the shape of the flatten input data.
Exception raised: <class 'RuntimeError'> with message: shape '[-1, 6000]' is invalid for input of size 1000
Thanks, Miao
Hello @caimiao0714,
Thank you for the kind words and your interest in the repo. :) From what I understand, your data is such that you have 2001 different data points each with 5 values (SO4 , NO3, NH4, OM, BC). Did I understand correctly?
In such a case, you should only specify the dimension of your data points (i.e 5 in your case) in the input_dim
argument of the VAEConfig
instance. See below a working examples adapted from your case but with random values
from pythae.pipelines import TrainingPipeline
from pythae.models import VAE, VAEConfig
from pythae.trainers import BaseTrainerConfig
import numpy as np
import torch
# dummy datasets
dl_dt = torch.randn(2001, 5)
my_training_config = BaseTrainerConfig(
output_dir='./',
num_epochs=5,
learning_rate=1e-3,
per_device_train_batch_size=200,
per_device_eval_batch_size=200,
train_dataloader_num_workers=2,
eval_dataloader_num_workers=2,
steps_saving=20,
optimizer_cls="AdamW",
optimizer_params={"weight_decay": 0.05, "betas": (0.91, 0.995)},
scheduler_cls="ReduceLROnPlateau",
scheduler_params={"patience": 5, "factor": 0.5}
)
# Set up the model configuration
my_vae_config = VAEConfig(
input_dim=(5,), ####### This is what changed from your code #######
latent_dim=10
)
# Build the model
my_vae_model = VAE(model_config=my_vae_config)
# Build the Pipeline
pipeline = TrainingPipeline(
training_config=my_training_config,
model=my_vae_model
)
dl_train_sample = dl_dt[0:1000,:].numpy()
dl_eval_sample = dl_dt[1001:2001,:].numpy()
# Launch the Pipeline
pipeline(
train_data=dl_train_sample, # must be torch.Tensor, np.array or torch datasets
eval_data=dl_eval_sample # must be torch.Tensor, np.array or torch datasets
)
PS: Do not hesitate to adapt the neural networks you use for the encoder and decoder to make it better suited for tabular data as well.
I hope this helps!
Best,
Clément
Hi Clément,
Thank you! This helps a lot. One more question is the step on data generation after fitting the model. I notice that the example in the official manual generates new data as pictures (.png
). I wonder if you could give an example that the data are generated as tabular data? Specifically, I would be interested in generating the disentangled tabular data for dl_train_sample
and dl_eval_sample
row by row.
Thanks, Miao
Hi @caimiao0714, I am glad to see that my previous comment helped :)
As to the generation of synthetic data, it is indeed performed after training the model. For instance, assuming that you have trained the model as explained in the previous comment, you can generate new synthetic tabular data as follows:
from pythae.models import AutoModel
from pythae.samplers import NormalSampler
# reload the trained model for the folder where it was stored
trained_model = AutoModel.load_from_folder('VAE_training_2023-03-23_18-25-25/final_model').eval()
# Create the sampler
sampler = NormalSampler(trained_model)
# Launche the sample function
gen_samples = sampler.sample(
num_samples=100, # specify the number of samples you want to generate
return_gen=True # specify that you want the sampler to return the generated samples
)
print(gen_samples.shape)
As to generating disentangled data, did you mean this in the sense of #78 ?
I hope this helps :)
Best,
Clément
Hi Clément,
Thanks for your help in generating samples. This is very useful!
For generating disentangled data, I'm not sure if I fully understand issue #78. Let me try to illustrate my point in a simpler way, and hopefully I could clearly illustrate my point.
Problem setting. For the dummy dataset generated by dl_dt = torch.randn(2001, 5)
, let's assume that it is a tensor with 5 features ($x_1, x_2, \ldots, x_5$), and actually I was trying to construct a supervised machine learning model for a dependent variable $y$ (dl_y = torch.randn(2001, 1)
). Let's assume that the supervised machine learning model is a simple linear model.
Why I chose disentanglement learning The reason why I'm trying to apply disentanglement learning for the dataset dl_dt
is the features $x_1, x_2, \ldots, x_5$ are highly correlated, and putting them all in the linear regression will cause the problem of multicollinearity. Therefore, I'm trying to use disentanglement learning models to disentangle $x_1, x_2, \ldots, x_5$ into relatively independent features $\hat{x}_1, \hat{x}_2, \ldots, \hat{x}_5$ (actually the disentangled features could be any number of features). After that, I could use the disentangled features $\hat{x}_1, \hat{x}_2, \ldots, \hat{x}_5$ to predict $y$ (dl_y
) and would not have the issue of multicollinearity anymore.
Problem with the current code At this stage, hopefully, you could see the problem with gen_samples
in your last response. These generated data (gen_samples
) are not related to the original data dl_dt
by row, so they cannot be used to predict $y$ (dl_y
) in the supervised machine learning models afterward.
I hope that my question and problem are clear.
Thanks, Miao
Hi @caimiao0714,
Sorry or the late reply. From what I understand (tell me if I am wrong), you would like to use a different representation of the input data that can be used as input for your supervised model. If so, you can definitely do this using the models available in the library. You can for instance use as inputs of your model the latent representations of dl_dt
. To retrieve the latent representation of your input, you can do the following using the embed
method.
from pythae.models import AutoModel
# Reload the train model
trained_model = AutoModel.load_from_folder('path/to/model').eval()
# Get the embeddings
embeddings = trained_model.embed(torch.from_numpy(dl_train_sample))
In such a case, each row of embeddings
corresponds then to the representation of the row dl_train_sample
in the latent space.
I hope this helps.
Best,
Clément
Hi Clément,
Thanks a lot for the comment. Yes this works. One additional question I have is how do I gain insights into the relationship between the original data and the embeddings
in the latent space. I tried to use Pearson correlation coefficients understand these two, but I found little correlations, see the figure below.
BC
, NH4
, ..., and SO4
on the x-axis are the original data, and V0
to V4
on the y-axis are the latent embeddings.
Miao
Hi @caimiao0714,
I am happy to see that this is working. As to the relationship between the latent embeddings and the input data, I am not sure what you are expecting from this. The VAE model will embed the input data in the latent space using potentially highly non-linear functions and so I am not sure that you will be able to relate the latent embedding coordinates directly to those of the input data. Nonetheless, you can still try with models that specifically target the tasks of learning disentangled representations such as the $\beta$-VAE, factorVAE of $\beta$-TC-VAE. Maybe those models can be helpful as well.
Best,
Clément