SDV Improve CTGAN performance by customizing categorical column transformations

Problem Description

CTGAN fails to finish in reasonable times on datasets like Airbnb and Housing Market.

Expected behavior

We should review CTGAN speed on big datasets by figuring out what the current limitation is (number of columns/rows/categorical variables). We can then create a plan of action for improving the performance.

Sep 09 '21 04:09 katxiao

One of the main limiting factors of CTGAN is the number of categorical values. Anything in the hundred's will slowdown the model manifold. In such cases, it may be better to have a hyperparameter that allows you to treat categorical values as numerical which defaults to 200 (so if there are more than 200 categories, it will be treated as a numerical).

Sep 13 '21 14:09 fealho

+1 the categorical variable coverage is a known bottleneck of the CTGAN model.

In future updates of the SDV, we will make it easier for users to modify the RDT transformers that are used on this data. In such a world, you would be able to specify label or frequency encoding instead of the default. In doing so, you risk the possibility of mode collapse; CTGAN may learn to only synthesize 1-2 categories. This may make for a good exploration when we're at that stage.

In the interest of being targeted with our issues & feature requests, I will retitle this to reflect the categorical column handling in CTGAN. We can close it after the SDV update and perform an exploration of what happens if you apply different transformers.

Jul 07 '22 20:07 npatki

As mentioned in the previous issue I'm closing this out, as it is now possible to assign transformers to CTGAN's data.

However, there is still an issue that CTGAN is treating the column as discrete instead of continuous, which is being tracked in #1450

Jun 01 '23 00:06 npatki