skrub icon indicating copy to clipboard operation
skrub copied to clipboard

[FEAT] Better handling of integers distribution in TableReport

Open Vincent-Maladiere opened this issue 1 year ago • 10 comments

Problem Description

The xticks locations of integer distributions are often off, spacing the bars irregularly, which looks visually inconsistent.

Screenshot 2024-11-28 at 12 23 58

The years in the plot above are floats, but converting to integers doesn't help.

Screenshot 2024-11-28 at 12 27 00

Feature Description

We could display the bars with regularity for integers (and floats?), especially when the number of bins is < 10. We can come up with simple heuristic/fix at first

Alternative Solutions

.

Additional Context

skrub 0.4.0 :))

Vincent-Maladiere avatar Nov 28 '24 11:11 Vincent-Maladiere

thanks @Vincent-Maladiere . to help look for a solution, here is a minimal reproducer of the issue that does not require generating a report:

from matplotlib import pyplot as plt
import numpy as np

x = np.arange(9)
fig, ax = plt.subplots()
ax.hist(x)

histogram

jeromedockes avatar Nov 28 '24 14:11 jeromedockes

also to try out solutions, could you share the "Year" column you used above?

jeromedockes avatar Nov 28 '24 14:11 jeromedockes

I think maybe when there are few unique values we shouldn't plot a histogram but a stem plot instead: https://matplotlib.org/stable/plot_types/basic/stem.html#sphx-glr-plot-types-basic-stem-py

or treat the variable as categorical and do a bar plot :thinking: if there was some way to detect that the actual values don't matter too much besides their ordering

jeromedockes avatar Nov 28 '24 14:11 jeromedockes

FWIW, the misalignment between bins and labels is something I've seen in general matplotlib use, so I don't know how it could be addressed specifically in the TableReport

I think maybe when there are few unique values we shouldn't plot a histogram but a stem plot instead: https://matplotlib.org/stable/plot_types/basic/stem.html#sphx-glr-plot-types-basic-stem-py

or treat the variable as categorical and do a bar plot 🤔 if there was some way to detect that the actual values don't matter too much besides their ordering

I like the idea of using stem plots

rcap107 avatar Nov 28 '24 15:11 rcap107

Maybe we could derive good heuristics using np.hist and plt.bar instead of plt.hist directly

Vincent-Maladiere avatar Nov 28 '24 15:11 Vincent-Maladiere

sure, I don't think it will make much of a difference -- plt.hist just forwards all arguments to np.hist

jeromedockes avatar Nov 28 '24 16:11 jeromedockes

What I meant is that we might have a better control of the bins by decoupling the hist computing from the bar plot. I don't have anything against stem plot though, as long as they are easy to see on small plots

Vincent-Maladiere avatar Nov 29 '24 08:11 Vincent-Maladiere

one question with the stem plots is how to handle outliers -- add a red stem on the side of the axis?

jeromedockes avatar Nov 29 '24 09:11 jeromedockes

btw here's another example in the "day of the week" column in this other issue

jeromedockes avatar Nov 29 '24 12:11 jeromedockes

one question with the stem plots is how to handle outliers -- add a red stem on the side of the axis?

This is where I prefer bars as well, although a red stem thingy looks fine I guess

Vincent-Maladiere avatar Nov 29 '24 13:11 Vincent-Maladiere