[FEAT] Better handling of integers distribution in TableReport
Problem Description
The xticks locations of integer distributions are often off, spacing the bars irregularly, which looks visually inconsistent.
The years in the plot above are floats, but converting to integers doesn't help.
Feature Description
We could display the bars with regularity for integers (and floats?), especially when the number of bins is < 10. We can come up with simple heuristic/fix at first
Alternative Solutions
.
Additional Context
skrub 0.4.0 :))
thanks @Vincent-Maladiere . to help look for a solution, here is a minimal reproducer of the issue that does not require generating a report:
from matplotlib import pyplot as plt
import numpy as np
x = np.arange(9)
fig, ax = plt.subplots()
ax.hist(x)
also to try out solutions, could you share the "Year" column you used above?
I think maybe when there are few unique values we shouldn't plot a histogram but a stem plot instead: https://matplotlib.org/stable/plot_types/basic/stem.html#sphx-glr-plot-types-basic-stem-py
or treat the variable as categorical and do a bar plot :thinking: if there was some way to detect that the actual values don't matter too much besides their ordering
FWIW, the misalignment between bins and labels is something I've seen in general matplotlib use, so I don't know how it could be addressed specifically in the TableReport
I think maybe when there are few unique values we shouldn't plot a histogram but a stem plot instead: https://matplotlib.org/stable/plot_types/basic/stem.html#sphx-glr-plot-types-basic-stem-py
or treat the variable as categorical and do a bar plot 🤔 if there was some way to detect that the actual values don't matter too much besides their ordering
I like the idea of using stem plots
Maybe we could derive good heuristics using np.hist and plt.bar instead of plt.hist directly
sure, I don't think it will make much of a difference -- plt.hist just forwards all arguments to np.hist
What I meant is that we might have a better control of the bins by decoupling the hist computing from the bar plot. I don't have anything against stem plot though, as long as they are easy to see on small plots
one question with the stem plots is how to handle outliers -- add a red stem on the side of the axis?
btw here's another example in the "day of the week" column in this other issue
one question with the stem plots is how to handle outliers -- add a red stem on the side of the axis?
This is where I prefer bars as well, although a red stem thingy looks fine I guess