pytorch_tabular
pytorch_tabular copied to clipboard
missing values and unknown categories in SSL models
Hello,
Allowing for missing values and/or unknown categories is not allowed for SSL models. Could you help me understand why this is the case? I real-world data this causes hard to understand error messages which then requires out-of-band pre-preprocessing of the data to resolve.
Could we allow for these options to be available in SSL models? Is there a fundamental reason that I am not understanding for this restriction?
reference to the section of the code in question: https://github.com/manujosephv/pytorch_tabular/blob/728578765b705cef5867f49289cf1cf203f1898f/src/pytorch_tabular/tabular_model.py#L234-L240
last bit of color: in a fork I removed this validation block and I was able to test and using both missing value and missing category handling in an SSL model training.
In SSL model(right now it's the Denoising Autoencoder), we are training the model to predict the input data back. In this learning objective, predicting missing values as a separate token didn't make sense to me. This is why that option was disabled to force the user to treat the missing values the right way.
Unlike prediction task, where it's beneficial to learn when some new category value shows up, in SSL does it make sense?
In SSL model(right now it's the Denoising Autoencoder), we are training the model to predict the input data back. In this learning objective, predicting missing values as a separate token didn't make sense to me. This is why that option was disabled to force the user to treat the missing values the right way.
Unlike prediction task, where it's beneficial to learn when some new category value shows up, in SSL does it make sense?
If missing values are expected to be present in the data at prediction time, allowing for them in the training data makes sense to me. As a practical matter, I would prefer the user be allowed to decide for themselves if they want this behaviour or not. Perhaps a warning in the logs or in the documentation instead of explicitly disabling the ability to choose might be a better option?
Hmmm... Yeah, I agree. But will also have to thoroughly test the inclusion for corner cases.
Would you be willing to raise a PR for it?
Sure thing:
https://github.com/manujosephv/pytorch_tabular/pull/470
I've been running this for a good month or so without issue but I'm happy to add / update documentation or test cases if you can describe to me what needs to be done.