pytorch_tabular icon indicating copy to clipboard operation
pytorch_tabular copied to clipboard

missing values and unknown categories in SSL models

Open sorenmacbeth opened this issue 1 year ago • 6 comments

Hello,

Allowing for missing values and/or unknown categories is not allowed for SSL models. Could you help me understand why this is the case? I real-world data this causes hard to understand error messages which then requires out-of-band pre-preprocessing of the data to resolve.

Could we allow for these options to be available in SSL models? Is there a fundamental reason that I am not understanding for this restriction?

sorenmacbeth avatar Jul 16 '24 18:07 sorenmacbeth

reference to the section of the code in question: https://github.com/manujosephv/pytorch_tabular/blob/728578765b705cef5867f49289cf1cf203f1898f/src/pytorch_tabular/tabular_model.py#L234-L240

sorenmacbeth avatar Jul 18 '24 06:07 sorenmacbeth

last bit of color: in a fork I removed this validation block and I was able to test and using both missing value and missing category handling in an SSL model training.

sorenmacbeth avatar Jul 18 '24 21:07 sorenmacbeth

In SSL model(right now it's the Denoising Autoencoder), we are training the model to predict the input data back. In this learning objective, predicting missing values as a separate token didn't make sense to me. This is why that option was disabled to force the user to treat the missing values the right way.

Unlike prediction task, where it's beneficial to learn when some new category value shows up, in SSL does it make sense?

manujosephv avatar Aug 19 '24 05:08 manujosephv

In SSL model(right now it's the Denoising Autoencoder), we are training the model to predict the input data back. In this learning objective, predicting missing values as a separate token didn't make sense to me. This is why that option was disabled to force the user to treat the missing values the right way.

Unlike prediction task, where it's beneficial to learn when some new category value shows up, in SSL does it make sense?

If missing values are expected to be present in the data at prediction time, allowing for them in the training data makes sense to me. As a practical matter, I would prefer the user be allowed to decide for themselves if they want this behaviour or not. Perhaps a warning in the logs or in the documentation instead of explicitly disabling the ability to choose might be a better option?

sorenmacbeth avatar Aug 19 '24 09:08 sorenmacbeth

Hmmm... Yeah, I agree. But will also have to thoroughly test the inclusion for corner cases.

Would you be willing to raise a PR for it?

manujosephv avatar Aug 19 '24 13:08 manujosephv

Sure thing:

https://github.com/manujosephv/pytorch_tabular/pull/470

I've been running this for a good month or so without issue but I'm happy to add / update documentation or test cases if you can describe to me what needs to be done.

sorenmacbeth avatar Aug 19 '24 17:08 sorenmacbeth