SDMetrics icon indicating copy to clipboard operation
SDMetrics copied to clipboard

`get_column_plot` produces misleading graphs (for uniform-like distributions)

Open fealho opened this issue 2 years ago • 0 comments

get_column_plot produces histograms which take a lot of liberty when representing the data, especially at the edges.

The Real data and the matplotlib plot represent the same data (ignore the synthetic data). Basically, the edges always start at 0.5 with the get_column_plot graph, which can be quite misleading.

Screenshot 2023-02-05 alle 7 40 54 AM

SDV code to generate the above:

    data = pd.DataFrame({'col1': np.random.random(1000)})
    metadata = SingleTableMetadata()
    metadata.detect_from_dataframe(data)
    synthesizer = GaussianCopulaSynthesizer(metadata)

    # Run and Assert
    synthesizer.fit(data)
    samples = synthesizer.sample(1000)
    print(samples)
    get_column_plot(data, samples, metadata, 'col1').show()

    import matplotlib.pyplot as plt
    plt.hist(data, 50)
    plt.ylabel('some numbers')
    plt.show()

fealho avatar Feb 05 '23 15:02 fealho