SynapseML
SynapseML copied to clipboard
[BUG] Relative entropy should be computed using base 2 logarithm for the Jensen-Shannon distance to be bound between 0 and 1
SynapseML version
0.11.1
System information
- Language version: Python 3.8.10
- Spark Version: 3.2.2.5.1-96380145
- Spark Platform: Azure ML Notebooks with Serverless Spark Compute
Describe the problem
One of the Distribution Balance Measures you use is the Jensen-Shannon Distance, defined in terms of the relative entropy in line 238 of DistributionBalanceMeasure.scala. The relative entropy is defined in line 276 of the same file as:
D = SUM(distA* log(distA/distB)).
This formula applies only when computing entropy in base e (see scipy doc). But for the Jensen-Shannon Distance to be bound between 0 and 1 (as stated in the documentation), the entropy needs to be computed using the base 2 logarithm (see Jensen-Shannon Distance wiki page). The definition of entropy used for the Jensen-Shannon distance thus should be:
D = SUM(distA * log(distA/distB)) / log(base)
Under the current definition the theoretical maximum Jensen-Shannon Distance is sqrt(ln(2))=0.83255... < 1
Code to reproduce issue
Here is an example of two extremely drifted distributions. Their Jensen-Shannon Distance has already converged to the theoretical maximum value stated above.
imbalanced_color_list = ['red'] * 9999999 + ['blue']
imbalanced_reference_dist = [{'red':0.0000001, 'blue':0.9999999}]
df_imbalanced = spark.createDataFrame(imbalanced_color_list, StringType()).toDF("color")
distribution_balance_measure_imb = (
DistributionBalanceMeasure()
.setSensitiveCols(['color'])
.setReferenceDistribution(imbalanced_reference_dist)
.transform(df_imbalanced).select("FeatureName","DistributionBalanceMeasure.js_dist")
)
distribution_balance_measure_imb.show(truncate=False)
+-----------+-----------------+
|FeatureName|js_dist |
+-----------+-----------------+
|color |0.832553583110652|
+-----------+-----------------+
We can reproduce this result using the Jensen-Shannon implementation in Scipy:
from scipy.spatial import distance
import numpy as np
import math
def jensen_shannon_distance_categorical(x_list, y_list, base=2):
# unique values observed in x and y
values = set(x_list + y_list)
x_counts = np.array([x_list.count(value) for value in values])
y_counts = np.array([y_list.count(value) for value in values])
x_ratios = x_counts / np.sum(x_counts) #Optional as JS-D normalizes probability vectors
y_ratios = y_counts / np.sum(y_counts)
return distance.jensenshannon(x_ratios, y_ratios, base=base)
imbalanced_source = ['red'] * 9999999 + ['blue']
imbalanced_target = ['red'] + ['blue'] * 9999999
jensen_shannon_distance_categorical(imbalanced_source, imbalanced_target, base=math.e)
0.832553583110652
jensen_shannon_distance_categorical(imbalanced_source, imbalanced_target, base=2)
0.999998765189656
When computing the Jensen-Shannon distance using base e logarithms for our example, or result approaches sqrt(ln(2))=0.83255..., while when using base 2 logarithms, the result approaches desired value of 1.
Other info / logs
No response
What component(s) does this bug affect?
- [ ]
area/cognitive: Cognitive project - [X]
area/core: Core project - [ ]
area/deep-learning: DeepLearning project - [ ]
area/lightgbm: Lightgbm project - [ ]
area/opencv: Opencv project - [ ]
area/vw: VW project - [ ]
area/website: Website - [ ]
area/build: Project build system - [ ]
area/notebooks: Samples under notebooks folder - [ ]
area/docker: Docker usage - [ ]
area/models: models related issue
What language(s) does this bug affect?
- [X]
language/scala: Scala source code - [ ]
language/python: Pyspark APIs - [ ]
language/r: R APIs - [ ]
language/csharp: .NET APIs - [ ]
language/new: Proposals for new client languages
What integration(s) does this bug affect?
- [X]
integrations/synapse: Azure Synapse integrations - [X]
integrations/azureml: Azure ML integrations - [X]
integrations/databricks: Databricks integrations
Hey @perezbecker :wave:! Thank you so much for reporting the issue/feature request :rotating_light:. Someone from SynapseML Team will be looking to triage this issue soon. We appreciate your patience.
@ms-kashyap can you please take a look?