Fix: "investigate why XGBoost always has high AUCROC for the detection"
Description
The XGBoost detection metric almost always has incredibly high AUROC. This issue is already mentioned as a comment in the source code.
The reason for this, and subsequent fix, are straightforward: the Synthcity library generates float values without considering the precision of the original dataset. Tree-based algorithms like XGBoost exploit these differences by splitting "in between" the differences in precision.
As an example: real data might contain an "Age" feature with values like 55, 61, 72. The generated synthetic data will be, e.g., 55.563434, 61.382734. XGBoost will split on, e.g., >55.01 and <54.99. Hereby synthetic and real data are easily distinguished by the classifier and AUROC is always close to 1.
How to Reproduce
- Go to 'synthcity/metrics/eval_detection.py', line 154, which mentions the issue as a comment: "# TODO: investigate why XGBoost always has high AUCROC for the detection"
- You can also reproduce this issue by observing that the detection metric for XGBoost is always close to 1, especially when there are some low-precision numerical features in the dataset (see the 'Age' example above).
Expected Behavior
We can round the synthetic data to the same precision of the real dataset. This needs to happen at least inside the XGBoost detection metric, but potentially, it needs to happen for the entire Synthcity library. If we generate synthetic data to replace real data, it probably needs to be of similar precision as the real data.
@JimAchterbergLUMC Good catch. This really bothers me too since I found many baselines generate float values even for the age feature.