Fix: "investigate why XGBoost always has high AUCROC for the detection"

Open JimAchterbergLUMC opened this issue 1 year ago • 1 comments

Description

The XGBoost detection metric almost always has incredibly high AUROC. This issue is already mentioned as a comment in the source code.

The reason for this, and subsequent fix, are straightforward: the Synthcity library generates float values without considering the precision of the original dataset. Tree-based algorithms like XGBoost exploit these differences by splitting "in between" the differences in precision.

As an example: real data might contain an "Age" feature with values like 55, 61, 72. The generated synthetic data will be, e.g., 55.563434, 61.382734. XGBoost will split on, e.g., >55.01 and <54.99. Hereby synthetic and real data are easily distinguished by the classifier and AUROC is always close to 1.

How to Reproduce

Go to 'synthcity/metrics/eval_detection.py', line 154, which mentions the issue as a comment: "# TODO: investigate why XGBoost always has high AUCROC for the detection"
You can also reproduce this issue by observing that the detection metric for XGBoost is always close to 1, especially when there are some low-precision numerical features in the dataset (see the 'Age' example above).

Expected Behavior

We can round the synthetic data to the same precision of the real dataset. This needs to happen at least inside the XGBoost detection metric, but potentially, it needs to happen for the entire Synthcity library. If we generate synthetic data to replace real data, it probably needs to be of similar precision as the real data.

Mar 30 '25 16:03 JimAchterbergLUMC

@JimAchterbergLUMC Good catch. This really bothers me too since I found many baselines generate float values even for the age feature.

Nov 13 '25 12:11 yangysc