CTGAN icon indicating copy to clipboard operation
CTGAN copied to clipboard

Fix SettingWithCopyWarning (may be leading to a numerical calculation bug)

Open davebulaval opened this issue 2 years ago • 2 comments

When doing the TVAE or CTGAN tutorial, one gets a lot of SettingWithCopyWarning due to the following line.

I found a bug by deeply looking at the code and trying to figure out some things.

I was not quite sure of the objective of doing such a flatten operation on single-line column data, and since it was the line raising the warning, I've started debugging. It has been proven that using the following data[column_name] = value could lead to potential bug.

Using the TVAE tutorial for the student_id.value column, one can obtain the following output table after the _transform_continuous method.

image

By reading the comment in the _transform_continuous method, it is clear that the column 0 must be the exact same value as the transformed vector generated by another package since it is the normalized value.

#  Converts the transformed data to the appropriate output format.
#  The first column (ending in '.normalized') stays the same,
#  but the lable encoded column (ending in '.component') is one hot encoded.

By looking at the idx=16 we can see that the value of column 0 is -0.60451, and according to the transformedoutput shown here, it should be equal to the following: -0.25646.

image

In total, in this example, there is 70 mismatch between the column transformed["student_id.value.normalized"] and the first column of the output (np.equal(transformed["student_id.value.normalized"].to_numpy(), output[:, 0])).

To fix:

Will push a PR but if we switch data[column_name] = data[column_name].to_numpy().flatten() to flatten_data = pd.DataFrame({column_name: data[column_name].to_numpy().flatten()}), it fixes all the mismatch. However, the one-hot encoded columns change compare to the initial code so I am not sure if this is ok.

davebulaval avatar May 05 '22 18:05 davebulaval

Hello, thanks for filing this issue with all the details and screenshots. I just want to acknowledge that we've read through it.

However, the one-hot encoded columns change compare to the initial code so I am not sure if this is ok.

CTGAN has a specific cross-entropy algorithm that requires the categorical columns to be properly one hot encoded and avoid mode collapse.

Let's keep this bug open for tracking. We'd need to handle this with caution and make sure there are no unintended consequences of this change. We can reach out when we have the bandwidth to fully investigate and review such a change.

npatki avatar Jul 14 '22 18:07 npatki

From my experimentation, it seems like it would impact the mode computing.

davebulaval avatar Jul 14 '22 19:07 davebulaval