skrub TableReport outliers improvement?

Describe the bug

I don't think this is a bug, I'm just creating this issue so that we can observe a different behaviour I just saw.

Here is a series where the outlier detection is slightly off. This is an integer series comprised of {0, 80, 100}. Here, the values are actually categorical because they just distinguish states, and the choice of the state values 0 and 80 is arbitrary (it could have been 0, 1, 2, but for some reason it's 0, 80, 100).

Unless we tune the heuristic a little bit to avoid counting outliers when the cardinality is "very low", I don't see an easy improvement.

WDYT?

Screenshot 2024-11-29 at 09 42 52

The column dataset:

state.csv

Steps/Code to Reproduce

import pandas as pd
from skrub import TableReport

df = pd.read_csv("state.csv")
TableReport(df)

Expected Results

No outliers

Actual Results

Some outliers

Versions

0.4.0 :)))

Nov 29 '24 08:11 Vincent-Maladiere

thanks @Vincent-Maladiere . I had considered turning off outlier detection when there are few unique values but actually having a wide range of values is still a problem if you have few. imagine you had values 1, 2, 3, and -1000 say to indicate some invalid value. if you don't cut the axis due to the -1000 1, 2, and 3 will all get squished together and you will lose the information.

what we would like in this case is to realize that the actual values don't matter and treat the variable as categorical as you say. but I'm not sure that its' reasonable to assume that is the case whenever there are few unique values :thinking:

Nov 29 '24 09:11 jeromedockes

here is another example from the "titanic" dataset

screenshot_2024-11-29T10:43:55+01:00

Nov 29 '24 09:11 jeromedockes

Yes, that's tricky I agree. The best scenario would be for the user to notice that and convert to string or category dtypes. The main difference between my screenshot and yours is that on mine the outlier segment is bigger than the category displayed on the left, so the outliers are not really outliers, if that makes sense?

what we would like in this case is to realize that the actual values don't matter and treat the variable as categorical as you say. but I'm not sure that its' reasonable to assume that is the case whenever there are few unique values 🤔

Let's wait a bit to get feedback and decide about this.

Nov 29 '24 16:11 Vincent-Maladiere