SynapseML [BUG] Relative entropy should be computed using base 2 logarithm for the Jensen-Shannon distance to be bound between 0 and 1

SynapseML version

0.11.1

System information

Language version: Python 3.8.10
Spark Version: 3.2.2.5.1-96380145
Spark Platform: Azure ML Notebooks with Serverless Spark Compute

Describe the problem

One of the Distribution Balance Measures you use is the Jensen-Shannon Distance, defined in terms of the relative entropy in line 238 of DistributionBalanceMeasure.scala. The relative entropy is defined in line 276 of the same file as:

D = SUM(distA* log(distA/distB)).

This formula applies only when computing entropy in base e (see scipy doc). But for the Jensen-Shannon Distance to be bound between 0 and 1 (as stated in the documentation), the entropy needs to be computed using the base 2 logarithm (see Jensen-Shannon Distance wiki page). The definition of entropy used for the Jensen-Shannon distance thus should be:

D = SUM(distA * log(distA/distB)) / log(base)

Under the current definition the theoretical maximum Jensen-Shannon Distance is sqrt(ln(2))=0.83255... < 1

Code to reproduce issue

Here is an example of two extremely drifted distributions. Their Jensen-Shannon Distance has already converged to the theoretical maximum value stated above.

imbalanced_color_list = ['red'] * 9999999 + ['blue']
imbalanced_reference_dist = [{'red':0.0000001, 'blue':0.9999999}]
df_imbalanced = spark.createDataFrame(imbalanced_color_list, StringType()).toDF("color")

distribution_balance_measure_imb = (
    DistributionBalanceMeasure()
    .setSensitiveCols(['color'])
    .setReferenceDistribution(imbalanced_reference_dist)
    .transform(df_imbalanced).select("FeatureName","DistributionBalanceMeasure.js_dist")
)

distribution_balance_measure_imb.show(truncate=False)

+-----------+-----------------+
|FeatureName|js_dist          |
+-----------+-----------------+
|color      |0.832553583110652|
+-----------+-----------------+

We can reproduce this result using the Jensen-Shannon implementation in Scipy:

from scipy.spatial import distance
import numpy as np
import math


def jensen_shannon_distance_categorical(x_list, y_list, base=2):
    
    # unique values observed in x and y
    values = set(x_list + y_list)
        
    x_counts = np.array([x_list.count(value) for value in values])
    y_counts = np.array([y_list.count(value) for value in values])
    
    x_ratios = x_counts / np.sum(x_counts)  #Optional as JS-D normalizes probability vectors
    y_ratios = y_counts / np.sum(y_counts)

    return distance.jensenshannon(x_ratios, y_ratios, base=base)
 
imbalanced_source = ['red'] * 9999999 + ['blue']
imbalanced_target = ['red']  + ['blue'] * 9999999

jensen_shannon_distance_categorical(imbalanced_source, imbalanced_target, base=math.e)
0.832553583110652

jensen_shannon_distance_categorical(imbalanced_source, imbalanced_target, base=2)
0.999998765189656

When computing the Jensen-Shannon distance using base e logarithms for our example, or result approaches sqrt(ln(2))=0.83255..., while when using base 2 logarithms, the result approaches desired value of 1.

Other info / logs

No response

What component(s) does this bug affect?

[ ] area/cognitive: Cognitive project
[X] area/core: Core project
[ ] area/deep-learning: DeepLearning project
[ ] area/lightgbm: Lightgbm project
[ ] area/opencv: Opencv project
[ ] area/vw: VW project
[ ] area/website: Website
[ ] area/build: Project build system
[ ] area/notebooks: Samples under notebooks folder
[ ] area/docker: Docker usage
[ ] area/models: models related issue

What language(s) does this bug affect?

[X] language/scala: Scala source code
[ ] language/python: Pyspark APIs
[ ] language/r: R APIs
[ ] language/csharp: .NET APIs
[ ] language/new: Proposals for new client languages

What integration(s) does this bug affect?

[X] integrations/synapse: Azure Synapse integrations
[X] integrations/azureml: Azure ML integrations
[X] integrations/databricks: Databricks integrations

Jul 06 '23 23:07 perezbecker

Hey @perezbecker :wave:! Thank you so much for reporting the issue/feature request :rotating_light:. Someone from SynapseML Team will be looking to triage this issue soon. We appreciate your patience.

Jul 06 '23 23:07 github-actions[bot]

@ms-kashyap can you please take a look?

Jul 10 '23 17:07 memoryz

SynapseML SynapseML copied to clipboard

[BUG] Relative entropy should be computed using base 2 logarithm for the Jensen-Shannon distance to be bound between 0 and 1

SynapseML version

System information

Describe the problem

Code to reproduce issue

Other info / logs

What component(s) does this bug affect?

What language(s) does this bug affect?

What integration(s) does this bug affect?

SynapseML
SynapseML copied to clipboard