soda-sql
soda-sql copied to clipboard
Frequent Values only calculated for numeric types
Seems like Frequent Values Metric is only calculated for numeric types, but this metric can be calculated for (I think) every type of column?
https://github.com/sodadata/soda-sql/blob/b6ce4a11c4eb4ce1a8cd6d4b05bec6e7a8d08424/sodasql/scan/scan.py#L344-L351
@Antoninj I think the solution should be to change the condition
if self.scan_yml.is_metric_enabled(Metric.FREQUENT_VALUES, column_name) \
and (scan_column.is_number or scan_column.is_column_numeric_text_format):
to
if self.scan_yml.is_metric_enabled(Metric.FREQUENT_VALUES, column_name) \
and scan_column.has_numeric_values
While this is always a good improvement to make. It could be that the problem of the numeric string values is still somewhere else. So you have to check the round trip to the UI to see if the frequent value for numeric string columns show up.
While you're at it, it might make sense to rename the property scan_column.has_numeric_values
to scan_column.is_numeric
, to make it consistent with is_number, is_time and is_text
@tombaeyens , I don't think the change you suggest would work since has_numeric_values
(or is_numeric
) is actually the same thing as what is already there, i.e scan_column.is_number or scan_column.is_column_numeric_text_format
.
I think what @dirkgroenen suggests makes sense, i.e to compute the FREQUENT_VALUES
metric on all column types.
The other way around, it looks like we currently compute top 5 min and max values on all columns. Wouldn't it make more sense to only compute those on numeric columns only?
Hmm, I'm getting a little confused here (also caused by myself btw). Looking at the issue I think it originates from requests coming from NNIP asking why they're not seeing as much profiling data on the new environment as they see on the old. Looking at those environment again now makes me wonder if my initial assumption was correct.
Let me try writing down what's going through my head now, hopefully it makes sense...
Initially I was looking at the frontend on both environments and thought to notice a difference between the columns "frequent values" and "extreme values". I think that's why I created this issue, to make sure the "Frequent Values" tab can be enabled.
Now, looking at it for a second time I don't actually think the behavior is that much different. That said I am getting really confused on those two tabs: "Frequent Values" and "Extreme Values".
Looking at the code it looks like "Frequent Values" is actually using the mins
and maxs
. Where as "Extreme Values" is coming from Column.profile.topValues
which seems to be a map of the "Frequent Values" metric in soda-sql?
The main question I've got right now: What actually differs "Frequent Values" from "Extreme Values" in the frontend? @tombaeyens can you maybe shed some light on this?
Frequent values : how often does a certain value occur. All values are ordered by occurrence (aka frequency). The values with the highest frequencies are on top of this list. (aka top values) (@dirkgroenen see my point in our discussion on mappings between soda sql language, backend language and even UI language :) )
Extreme values: highest and lowest values in the column. All values are ordered by their natural ordering. The highest are the maxs, the lowest are the mins and those 2 are referred to in the UI as extreme values (i don't like that name but that is beside the point)
@Antoninj Your observation is correct. The change I propose does logically the same. It was because was a problem is suspected in this area that I started to clean things up there. I still think it's a good improvement to make. But you're right in the sense that it probably doesn't explain the problem we re looking for here.
Who will clear out the naming mistake and figure out if there is still something to be fixed in Soda SQL vs Soda Cloud?
(@dirkgroenen see my point in our discussion on mappings between soda sql language, backend language and even UI language :)
Isn't it ironic? :wink:
Frequent values : how often does a certain value occur. All values are ordered by occurrence (aka frequency). The values with the highest frequencies are on top of this list. (aka top values) (@dirkgroenen see my point in our discussion on mappings between soda sql language, backend language and even UI language :) )
Extreme values: highest and lowest values in the column. All values are ordered by their natural ordering. The highest are the maxs, the lowest are the mins and those 2 are referred to in the UI as extreme values (i don't like that name but that is beside the point)
Interesting. If I look at the current implementation of frequent and extreme values on the frontend side it seems to be exactly the opposite of what you're describing.
Extreme values currently holds the column.profile.mins
and column.profile.maxs
, which translates to the Metric.MINS
and Metric.MAXS
in soda-sql. If I understand you correctly @tombaeyens the extreme values should be the "highest" and the "lowest" values in a column, so that means numeric- and date-like values only, right?
When I look at NNIP's Security Master client_b_id
column I can see that the Extreme Values
is enabled (which means mins
and maxs
metrics was calculated). The result of this is a bit weird though, cause it basically orders the strings ascending and takes the first and last 5 values. If I understand correctly I would expect the Extreme Values
not to have been calculated for this column. Instead I would expect to only see the Frequent Values, which would show me how often certain (duplicate) string-values occur in the column.
Who will clear out the naming mistake and figure out if there is still something to be fixed in Soda SQL vs Soda Cloud?
Summarizing the above I think the following things need to be changed:
- Only calculate 'Extreme Values' (
MINS
andMAXS
metric) for columns withnumeric
ordate
values. - Calculate 'Frequent Values' (
FREQUENT_VALUES
metric) for every column, regardless of its type - Optional: change
Column.profile
to deprecatetopValues
and rename it tofrequentValues
to bring it in line with soda-sql metrics.
@tombaeyens can you verify if my above assumptions and required changes are correct and if so can you, @Antoninj, make the required changes?
I have the same understanding of the current state and required changes as @dirkgroenen. Point 2 has actually already been handled in https://github.com/sodadata/soda-sql/commit/84e6d3254550f581cbd64faede5be7c3c9992998, so I'll take care of point 1 (as long as @tombaeyens agrees 'extreme values' are only relevant for columns where it is possible to define a meaningful natural ordering such as numbers and dates).
Extreme values currently holds the column.profile.mins and column.profile.maxs, which translates to the Metric.MINS and Metric.MAXS in soda-sql.
Yes. Looks good to me.
If I understand you correctly @tombaeyens the extreme values should be the "highest" and the "lowest" values in a column,
Yes
so that means numeric- and date-like values only, right?
I assume numeric, date-like types and string values as well. The min / max (singular) is to be excluded in the profile overview for text values. But the extreme values should also be present for text values imo.
@mathissedestrooper Can you confirm?
When I look at NNIP's Security Master client_b_id column I can see that the Extreme Values is enabled (which means mins and maxs metrics was calculated). The result of this is a bit weird though, cause it basically orders the strings ascending and takes the first and last 5 values.
I don't see what's wrong with that. Also for text like columns you want to see the range like eg
Aleksic
Baeyens
Boutonnet
...
Masschelein
Kiran
Zorro
@tombaeyens agreed:
- MIN/MAX doesn't make sense for TEXT
- Extreme values makes sense (like the example above)