soda-sql icon indicating copy to clipboard operation
soda-sql copied to clipboard

Frequent Values only calculated for numeric types

Open dirkgroenen opened this issue 3 years ago • 10 comments

Seems like Frequent Values Metric is only calculated for numeric types, but this metric can be calculated for (I think) every type of column?

https://github.com/sodadata/soda-sql/blob/b6ce4a11c4eb4ce1a8cd6d4b05bec6e7a8d08424/sodasql/scan/scan.py#L344-L351

dirkgroenen avatar Apr 02 '21 08:04 dirkgroenen

@Antoninj I think the solution should be to change the condition

if self.scan_yml.is_metric_enabled(Metric.FREQUENT_VALUES, column_name) \ 
         and (scan_column.is_number or scan_column.is_column_numeric_text_format): 

to

if self.scan_yml.is_metric_enabled(Metric.FREQUENT_VALUES, column_name) \ 
         and scan_column.has_numeric_values

While this is always a good improvement to make. It could be that the problem of the numeric string values is still somewhere else. So you have to check the round trip to the UI to see if the frequent value for numeric string columns show up.

While you're at it, it might make sense to rename the property scan_column.has_numeric_values to scan_column.is_numeric, to make it consistent with is_number, is_time and is_text

tombaeyens avatar Apr 09 '21 07:04 tombaeyens

@tombaeyens , I don't think the change you suggest would work since has_numeric_values (or is_numeric) is actually the same thing as what is already there, i.e scan_column.is_number or scan_column.is_column_numeric_text_format.

I think what @dirkgroenen suggests makes sense, i.e to compute the FREQUENT_VALUES metric on all column types.

Antoninj avatar Apr 09 '21 08:04 Antoninj

The other way around, it looks like we currently compute top 5 min and max values on all columns. Wouldn't it make more sense to only compute those on numeric columns only?

Antoninj avatar Apr 09 '21 09:04 Antoninj

Hmm, I'm getting a little confused here (also caused by myself btw). Looking at the issue I think it originates from requests coming from NNIP asking why they're not seeing as much profiling data on the new environment as they see on the old. Looking at those environment again now makes me wonder if my initial assumption was correct.

Let me try writing down what's going through my head now, hopefully it makes sense...

Initially I was looking at the frontend on both environments and thought to notice a difference between the columns "frequent values" and "extreme values". I think that's why I created this issue, to make sure the "Frequent Values" tab can be enabled.

Now, looking at it for a second time I don't actually think the behavior is that much different. That said I am getting really confused on those two tabs: "Frequent Values" and "Extreme Values".

Looking at the code it looks like "Frequent Values" is actually using the mins and maxs. Where as "Extreme Values" is coming from Column.profile.topValues which seems to be a map of the "Frequent Values" metric in soda-sql?

The main question I've got right now: What actually differs "Frequent Values" from "Extreme Values" in the frontend? @tombaeyens can you maybe shed some light on this?

dirkgroenen avatar Apr 09 '21 10:04 dirkgroenen

Frequent values : how often does a certain value occur. All values are ordered by occurrence (aka frequency). The values with the highest frequencies are on top of this list. (aka top values) (@dirkgroenen see my point in our discussion on mappings between soda sql language, backend language and even UI language :) )

Extreme values: highest and lowest values in the column. All values are ordered by their natural ordering. The highest are the maxs, the lowest are the mins and those 2 are referred to in the UI as extreme values (i don't like that name but that is beside the point)

@Antoninj Your observation is correct. The change I propose does logically the same. It was because was a problem is suspected in this area that I started to clean things up there. I still think it's a good improvement to make. But you're right in the sense that it probably doesn't explain the problem we re looking for here.

Who will clear out the naming mistake and figure out if there is still something to be fixed in Soda SQL vs Soda Cloud?

tombaeyens avatar Apr 09 '21 11:04 tombaeyens

(@dirkgroenen see my point in our discussion on mappings between soda sql language, backend language and even UI language :)

Isn't it ironic? :wink:

Frequent values : how often does a certain value occur. All values are ordered by occurrence (aka frequency). The values with the highest frequencies are on top of this list. (aka top values) (@dirkgroenen see my point in our discussion on mappings between soda sql language, backend language and even UI language :) )

Extreme values: highest and lowest values in the column. All values are ordered by their natural ordering. The highest are the maxs, the lowest are the mins and those 2 are referred to in the UI as extreme values (i don't like that name but that is beside the point)

Interesting. If I look at the current implementation of frequent and extreme values on the frontend side it seems to be exactly the opposite of what you're describing.

Extreme values currently holds the column.profile.mins and column.profile.maxs, which translates to the Metric.MINS and Metric.MAXS in soda-sql. If I understand you correctly @tombaeyens the extreme values should be the "highest" and the "lowest" values in a column, so that means numeric- and date-like values only, right?

When I look at NNIP's Security Master client_b_id column I can see that the Extreme Values is enabled (which means mins and maxs metrics was calculated). The result of this is a bit weird though, cause it basically orders the strings ascending and takes the first and last 5 values. If I understand correctly I would expect the Extreme Values not to have been calculated for this column. Instead I would expect to only see the Frequent Values, which would show me how often certain (duplicate) string-values occur in the column.

Who will clear out the naming mistake and figure out if there is still something to be fixed in Soda SQL vs Soda Cloud?

Summarizing the above I think the following things need to be changed:

  1. Only calculate 'Extreme Values' (MINS and MAXS metric) for columns with numeric or date values.
  2. Calculate 'Frequent Values' (FREQUENT_VALUES metric) for every column, regardless of its type
  3. Optional: change Column.profile to deprecate topValues and rename it to frequentValues to bring it in line with soda-sql metrics.

@tombaeyens can you verify if my above assumptions and required changes are correct and if so can you, @Antoninj, make the required changes?

dirkgroenen avatar Apr 09 '21 13:04 dirkgroenen

I have the same understanding of the current state and required changes as @dirkgroenen. Point 2 has actually already been handled in https://github.com/sodadata/soda-sql/commit/84e6d3254550f581cbd64faede5be7c3c9992998, so I'll take care of point 1 (as long as @tombaeyens agrees 'extreme values' are only relevant for columns where it is possible to define a meaningful natural ordering such as numbers and dates).

Antoninj avatar Apr 09 '21 13:04 Antoninj

Extreme values currently holds the column.profile.mins and column.profile.maxs, which translates to the Metric.MINS and Metric.MAXS in soda-sql.

Yes. Looks good to me.

If I understand you correctly @tombaeyens the extreme values should be the "highest" and the "lowest" values in a column,

Yes

so that means numeric- and date-like values only, right?

I assume numeric, date-like types and string values as well. The min / max (singular) is to be excluded in the profile overview for text values. But the extreme values should also be present for text values imo.

@mathissedestrooper Can you confirm?

tombaeyens avatar Apr 16 '21 10:04 tombaeyens

When I look at NNIP's Security Master client_b_id column I can see that the Extreme Values is enabled (which means mins and maxs metrics was calculated). The result of this is a bit weird though, cause it basically orders the strings ascending and takes the first and last 5 values.

I don't see what's wrong with that. Also for text like columns you want to see the range like eg

Aleksic
Baeyens
Boutonnet
...
Masschelein
Kiran
Zorro

tombaeyens avatar Apr 16 '21 10:04 tombaeyens

@tombaeyens agreed:

  • MIN/MAX doesn't make sense for TEXT
  • Extreme values makes sense (like the example above)

mathissedestrooper avatar Apr 16 '21 14:04 mathissedestrooper