SynapseML
SynapseML copied to clipboard
[BUG] Wasserstein Distance is ill-defined for categorical data and should be excluded
SynapseML version
0.11.1
System information
- Language version: Irrelevant
- Spark Version: Irrelevant
- Spark Platform: Irrelevant
Describe the problem
The Wasserstein distance, sometimes referred to as earth mover's distance (under some conditions), is perhaps most easily understood by imagining distributions as different ways of piling up earth. The earth mover's distance is the minimum cost of transforming one distribution to the other by shuffling earth around, where cost is defined as the amount of dirt moved multiplied by the horizontal distance over which it is moved. In this analogy the effects of gravity are ignored (in reality it takes work to lift dirt up).
Lets take two simple examples. Our goal is to flatten the initial distribution defined over metric space spanning from 1 to 3 into a uniform distribution and measure the cost it took.
Example 1: Here we have an excess of dirt at 2 which we can move into 3. The cost of flattening the distribution is a single unit.
o o o o o o o o o o o o ----- -> ----- 1 2 3 1 2 3
Example 2: Here the excess of dirt is initially at 1, which needs to needs to be first moved to 2 and then finally to 3. The cost of flattening the distribution is 2 units.
o o o o o o o o o o o o o o o o o o ----- -> ----- -> ----- 1 2 3 1 2 3 1 2 3
We can confirm this result with the scipy.stats.wasserstein_distance function:
# Example 1
wasserstein_distance([1,1,2,2,2,3],[1,1,2,2,3,3])
0.166666
# Example 2
wasserstein_distance([1,1,2,2,2,3],[1,1,2,2,3,3])
0.333333
What is important to note is that the Wasserstein distance of Example 2 is twice that of Example 1, as it was in our toy example.
Now we can understand why the Wasserstein distance can't really be defined for categorical variables. Imagine that instead of the range between 1 and 3 the distribution is over three colors: red, green, and blue. What is the cost moving earth from green to blue? Is it the same as moving it from red to blue? Will everyone agree on a the same answer? It is no longer clear whether Example 2 should cost twice as Example 1. We confirm this issue by computing the Wasserstein distance for Example 1 and Example 2 as for categorical columns as defined by the DistributionBalanceMeasures functions of SynapseML:
# Example 1
Example1 = [1,1,1,2,2,3]
df1 = spark.createDataFrame(Example1, IntegerType()).toDF("earth")
wasserstein_distance_example1 = (
DistributionBalanceMeasure()
.setSensitiveCols(['earth'])
.transform(df1).select("FeatureName","DistributionBalanceMeasure.wasserstein_dist")
)
wasserstein_distance_example1.show(truncate=False)
+-----------+-------------------+
|FeatureName|wasserstein_dist |
+-----------+-------------------+
|earth |0.11111111111111112|
+-----------+-------------------+
# Example 2
Example2 = [1,1,2,2,2,3]
df2 = spark.createDataFrame(Example2, IntegerType()).toDF("earth")
wasserstein_distance_example2 = (
DistributionBalanceMeasure()
.setSensitiveCols(['earth'])
.transform(df2).select("FeatureName","DistributionBalanceMeasure.wasserstein_dist")
)
wasserstein_distance_example2.show(truncate=False)
+-----------+-------------------+
|FeatureName|wasserstein_dist |
+-----------+-------------------+
|earth |0.11111111111111112|
+-----------+-------------------+
The Wasserstein distance is the same for Example 1 and Example 2, which is not what we expect.
Code to reproduce issue
Here is another way of showcasing the issue, by coding up a function that can compute the Wasserstein Distance for these simple examples:
def wasserstein_dist(x,y):
sorted_x = sorted(x)
sorted_y = sorted(y)
if len(sorted_x) != len(sorted_y):
raise ValueError("The two arrays must have the same length.")
sum = 0
for i in range(len(sorted_x)):
sum += abs(sorted_x[i] - sorted_y[i])
return sum / len(sorted_x)
Example1 = [1,1,1,2,2,3]
Example2 = [1,1,2,2,2,3]
y = [1,1,2,2,3,3]
wasserstein_dist(Example1,y)
0.33333333
wasserstein_dist(Example2,y)
0.16666666
In contrast, what are DistributionBalanceMeasures functions of SynapseML are computing is:
from collections import Counter
def categorical_fractions(categorical_data):
# Count the occurrences of each feature
feature_counts = Counter(categorical_data)
# Calculate the total number of features
total_features = len(categorical_data)
# Calculate the fraction each feature occurs
fractions = {feature: count / total_features for feature, count in feature_counts.items()}
return fractions
def cat_wasserstein_dist(x_probs, y_probs):
# Combine the keys from both dictionaries
all_keys = set(x_probs.keys()) | set(y_probs.keys())
# Calculate the absolute differences for each key
absolute_differences = [abs(x_probs.get(key, 0) - y_probs.get(key, 0)) for key in all_keys]
# Calculate the mean of the absolute differences
mean_difference = sum(absolute_differences) / len(absolute_differences)
return mean_difference
Example1 = [1,1,1,2,2,3]
Example2 = [1,1,2,2,2,3]
y = [1,1,2,2,3,3]
Example1_probs = categorical_fractions(Example1)
Example2_probs = categorical_fractions(Example2)
y_probs = categorical_fractions(y)
cat_wasserstein_dist(Example1_probs, y_probs)
0.11111111
cat_wasserstein_dist(Example2_probs, y_probs)
0.11111111
Note that both definitions contain similar mathematical expressions, especially the component computing mean(abs(X - Y)) (present in the source code here), but they are fundamentally computing different things. In the Wasserstein distance, the difference between X and Y is being computed over the metric space (the horizontal direction in the earth pile distribution), while for the definition used in DistributionBalanceMeasures it is within a category. This would correspond to the vertical direction in the earth pile distribution. As currently defined, the Wasserstein Distance computed by DistribuionBalanceMeasures is more akin to what the Jensen-Shannon Distance.
Recommendation
My recommendation is to completely avoid trying to define the Wasserstein distance for categorical features. To properly define it, we require the probability distribution to be defined over a metric space, which categorical variables do not create. In principle, we could define distances between categories and thus define a metric analogous to the Wasserstein distance, but this is really more trouble than it is worth and would be a source of endless confusion for users. We already have measures that are properly defined for categorical variables, like the Jensen-Shannon Distance, so we should use those.
Other info / logs
No response
What component(s) does this bug affect?
- [ ]
area/cognitive: Cognitive project - [X]
area/core: Core project - [ ]
area/deep-learning: DeepLearning project - [ ]
area/lightgbm: Lightgbm project - [ ]
area/opencv: Opencv project - [ ]
area/vw: VW project - [ ]
area/website: Website - [ ]
area/build: Project build system - [ ]
area/notebooks: Samples under notebooks folder - [ ]
area/docker: Docker usage - [ ]
area/models: models related issue
What language(s) does this bug affect?
- [ ]
language/scala: Scala source code - [ ]
language/python: Pyspark APIs - [ ]
language/r: R APIs - [ ]
language/csharp: .NET APIs - [ ]
language/new: Proposals for new client languages
What integration(s) does this bug affect?
- [X]
integrations/synapse: Azure Synapse integrations - [X]
integrations/azureml: Azure ML integrations - [X]
integrations/databricks: Databricks integrations
Hey @perezbecker :wave:! Thank you so much for reporting the issue/feature request :rotating_light:. Someone from SynapseML Team will be looking to triage this issue soon. We appreciate your patience.
@ms-kashyap can you please take a look?
@perezbecker Thanks for filing boss, tagging in creator @ms-kashyap