SynapseML [BUG] Wasserstein Distance is ill-defined for categorical data and should be excluded

SynapseML version

0.11.1

System information

Language version: Irrelevant
Spark Version: Irrelevant
Spark Platform: Irrelevant

Describe the problem

The Wasserstein distance, sometimes referred to as earth mover's distance (under some conditions), is perhaps most easily understood by imagining distributions as different ways of piling up earth. The earth mover's distance is the minimum cost of transforming one distribution to the other by shuffling earth around, where cost is defined as the amount of dirt moved multiplied by the horizontal distance over which it is moved. In this analogy the effects of gravity are ignored (in reality it takes work to lift dirt up).

Lets take two simple examples. Our goal is to flatten the initial distribution defined over metric space spanning from 1 to 3 into a uniform distribution and measure the cost it took.

Example 1: Here we have an excess of dirt at 2 which we can move into 3. The cost of flattening the distribution is a single unit.

  o                
o o        o o o 
o o o      o o o
-----  ->  -----
1 2 3      1 2 3

Example 2: Here the excess of dirt is initially at 1, which needs to needs to be first moved to 2 and then finally to 3. The cost of flattening the distribution is 2 units.

o             o      
o o         o o        o o o 
o o o       o o o      o o o
-----  ->   -----  ->  -----
1 2 3       1 2 3      1 2 3

We can confirm this result with the scipy.stats.wasserstein_distance function:

# Example 1
wasserstein_distance([1,1,2,2,2,3],[1,1,2,2,3,3])
0.166666

# Example 2
wasserstein_distance([1,1,2,2,2,3],[1,1,2,2,3,3])
0.333333

What is important to note is that the Wasserstein distance of Example 2 is twice that of Example 1, as it was in our toy example.

Now we can understand why the Wasserstein distance can't really be defined for categorical variables. Imagine that instead of the range between 1 and 3 the distribution is over three colors: red, green, and blue. What is the cost moving earth from green to blue? Is it the same as moving it from red to blue? Will everyone agree on a the same answer? It is no longer clear whether Example 2 should cost twice as Example 1. We confirm this issue by computing the Wasserstein distance for Example 1 and Example 2 as for categorical columns as defined by the DistributionBalanceMeasures functions of SynapseML:

# Example 1
Example1 = [1,1,1,2,2,3]
df1 = spark.createDataFrame(Example1, IntegerType()).toDF("earth")

wasserstein_distance_example1 = (
    DistributionBalanceMeasure()
    .setSensitiveCols(['earth'])
    .transform(df1).select("FeatureName","DistributionBalanceMeasure.wasserstein_dist")
)

wasserstein_distance_example1.show(truncate=False)
+-----------+-------------------+
|FeatureName|wasserstein_dist   |
+-----------+-------------------+
|earth      |0.11111111111111112|
+-----------+-------------------+


# Example 2
Example2 = [1,1,2,2,2,3]
df2 = spark.createDataFrame(Example2, IntegerType()).toDF("earth")

wasserstein_distance_example2 = (
    DistributionBalanceMeasure()
    .setSensitiveCols(['earth'])
    .transform(df2).select("FeatureName","DistributionBalanceMeasure.wasserstein_dist")
)

wasserstein_distance_example2.show(truncate=False)
+-----------+-------------------+
|FeatureName|wasserstein_dist   |
+-----------+-------------------+
|earth      |0.11111111111111112|
+-----------+-------------------+

The Wasserstein distance is the same for Example 1 and Example 2, which is not what we expect.

Code to reproduce issue

Here is another way of showcasing the issue, by coding up a function that can compute the Wasserstein Distance for these simple examples:

def wasserstein_dist(x,y):
    sorted_x = sorted(x)
    sorted_y = sorted(y)
    
    if len(sorted_x) != len(sorted_y):
        raise ValueError("The two arrays must have the same length.")
    
    sum = 0
    for i in range(len(sorted_x)):
        sum += abs(sorted_x[i] - sorted_y[i])
    
    return sum / len(sorted_x)

Example1 = [1,1,1,2,2,3]
Example2 = [1,1,2,2,2,3]
y = [1,1,2,2,3,3]

wasserstein_dist(Example1,y)
0.33333333

wasserstein_dist(Example2,y)
0.16666666

In contrast, what are DistributionBalanceMeasures functions of SynapseML are computing is:

from collections import Counter  
  
def categorical_fractions(categorical_data):  
    
    # Count the occurrences of each feature  
    feature_counts = Counter(categorical_data)  
      
    # Calculate the total number of features  
    total_features = len(categorical_data)  
      
    # Calculate the fraction each feature occurs  
    fractions = {feature: count / total_features for feature, count in feature_counts.items()}  
      
    return fractions  


def cat_wasserstein_dist(x_probs, y_probs):
    
    # Combine the keys from both dictionaries  
    all_keys = set(x_probs.keys()) | set(y_probs.keys())  
      
    # Calculate the absolute differences for each key  
    absolute_differences = [abs(x_probs.get(key, 0) - y_probs.get(key, 0)) for key in all_keys]  
      
    # Calculate the mean of the absolute differences  
    mean_difference = sum(absolute_differences) / len(absolute_differences)  
      
    return mean_difference  

Example1 = [1,1,1,2,2,3]
Example2 = [1,1,2,2,2,3]
y = [1,1,2,2,3,3]

Example1_probs = categorical_fractions(Example1)
Example2_probs = categorical_fractions(Example2)
y_probs = categorical_fractions(y)

cat_wasserstein_dist(Example1_probs, y_probs)
0.11111111

cat_wasserstein_dist(Example2_probs, y_probs)
0.11111111

Note that both definitions contain similar mathematical expressions, especially the component computing mean(abs(X - Y)) (present in the source code here), but they are fundamentally computing different things. In the Wasserstein distance, the difference between X and Y is being computed over the metric space (the horizontal direction in the earth pile distribution), while for the definition used in DistributionBalanceMeasures it is within a category. This would correspond to the vertical direction in the earth pile distribution. As currently defined, the Wasserstein Distance computed by DistribuionBalanceMeasures is more akin to what the Jensen-Shannon Distance.

Recommendation

My recommendation is to completely avoid trying to define the Wasserstein distance for categorical features. To properly define it, we require the probability distribution to be defined over a metric space, which categorical variables do not create. In principle, we could define distances between categories and thus define a metric analogous to the Wasserstein distance, but this is really more trouble than it is worth and would be a source of endless confusion for users. We already have measures that are properly defined for categorical variables, like the Jensen-Shannon Distance, so we should use those.

Other info / logs

No response

What component(s) does this bug affect?

[ ] area/cognitive: Cognitive project
[X] area/core: Core project
[ ] area/deep-learning: DeepLearning project
[ ] area/lightgbm: Lightgbm project
[ ] area/opencv: Opencv project
[ ] area/vw: VW project
[ ] area/website: Website
[ ] area/build: Project build system
[ ] area/notebooks: Samples under notebooks folder
[ ] area/docker: Docker usage
[ ] area/models: models related issue

What language(s) does this bug affect?

[ ] language/scala: Scala source code
[ ] language/python: Pyspark APIs
[ ] language/r: R APIs
[ ] language/csharp: .NET APIs
[ ] language/new: Proposals for new client languages

What integration(s) does this bug affect?

[X] integrations/synapse: Azure Synapse integrations
[X] integrations/azureml: Azure ML integrations
[X] integrations/databricks: Databricks integrations

Jul 10 '23 16:07 perezbecker

Hey @perezbecker :wave:! Thank you so much for reporting the issue/feature request :rotating_light:. Someone from SynapseML Team will be looking to triage this issue soon. We appreciate your patience.

Jul 10 '23 16:07 github-actions[bot]

@ms-kashyap can you please take a look?

Jul 10 '23 17:07 memoryz

@perezbecker Thanks for filing boss, tagging in creator @ms-kashyap

Jul 10 '23 17:07 mhamilton723

SynapseML SynapseML copied to clipboard

[BUG] Wasserstein Distance is ill-defined for categorical data and should be excluded

SynapseML version

System information

Describe the problem

Code to reproduce issue

Recommendation

Other info / logs

What component(s) does this bug affect?

What language(s) does this bug affect?

What integration(s) does this bug affect?

SynapseML
SynapseML copied to clipboard