tab-ddpm icon indicating copy to clipboard operation
tab-ddpm copied to clipboard

Trouble training/sampling on data with high-cardinality categorical features

Open reed-peterson-947 opened this issue 2 years ago • 2 comments

I've had success in training/generating data with this package on a variety of different datasets, but I have noticed when there is a very high cardinality feature present in a dataset this package fails with a very uninformative error message: "Killed" and nothing else. As soon as I remove the high-cardinality feature, it runs fine. By high-cardinality I mean on the order of tens of thousands of unique values for a given column. Not sure how to debug or where to start given the uninformative nature of the error message. The last line of code that seems to be executed before it gets killed is line 579 in lib/data,py. Any ideas? Anyone else have this same issue?

reed-peterson-947 avatar Aug 22 '23 18:08 reed-peterson-947

Hello,

I am not sure, but maybe you are out of RAM due to OneHotEncoder and high-cardinality of features

rotot0 avatar Oct 03 '23 15:10 rotot0

Moreover, even such a sophisticated model won't make magic out of a categorical feature with so many modalities ... unless you have millions of rows, and even in this case I bet you'll have many modalities unrepresented in the synthetic data. So you could try to pre-process this column with domain-based knowledge ?

paulduf avatar Oct 18 '23 15:10 paulduf