miller icon indicating copy to clipboard operation
miller copied to clipboard

Summing up empty data

Open Poshi opened this issue 2 years ago • 8 comments

I found a small lack of documentation and an unexpected result while adding data.

I'm using an input in homogeneous CSV format, so all data is either empty or present and defined, but never absent. It could be something like:

x,y,z
3,6,
2,,7
2,,
1,5,8

I'm trying to perform the addition of these three fields, expecting mlr to interpret the empty fields as zeroes, like if they were absent, to minimize keystrokes (following the policy of simplifying the life to the user with non-present data). These are the results:

$ mlr --csv --from input.csv put '$sum = $x + $y + $z'
x,y,z,sum
3,6,,
2,,7,
2,,,
1,5,8,14

Only when all the data is non_empty the addition is performed. From the documentation I can see that this is expected. The reasoning behind is that conceptually, absent and empty are different things and have different meanings, but when you are in a homogeneous CSV context where absents are not possible, we hit this issue. In my specific case, the columns are file sizes, and the empty ones are for files that does not exist (so they are absent), but the only possible representation is be leaving the blank in the file. Well, until this point it is just food for though (decisions and implementation has already been done so...).

But I also identified a point in the documentation that can be easily expanded and will help clarify this point: I saw that in reference-main-null-data help page, the table at the end (arithmetic rules) does not contain an entry for empty data, only for absents and errors. If the behavior must be different from absent, it should be included here too.

Poshi avatar Mar 23 '22 12:03 Poshi

@Poshi this absolutely sounds like a miss, perhaps in the Go port.

I absolutely treated empties as zeroes, documented that, regression-tested for it.

Let me see what I missed in the port.

johnkerl avatar Mar 23 '22 22:03 johnkerl

@Poshi I stand corrected. Miller 5 -> 6 (C -> Go) didn't change behavior here. What I was recalling above as being "absolutely" true was the behavior for stats1 -- which was (and is) inconsistent not from Miller 5 to Miller 6, but rather, between stats1 and +.

$ mlr5 --c2p --from input.csv put '$sum = $x + $y + $z'
x y z sum
3 6 - -
2 - 7 -
2 - - -
1 5 8 14

$ mlr --c2p --from input.csv put '$sum = $x + $y + $z'
x y z sum
3 6 - -
2 - 7 -
2 - - -
1 5 8 14

$ mlr5 --c2p --from input.csv stats1 -a sum -f x,y,z
x_sum y_sum z_sum
8     11    15

$ mlr --c2p --from input.csv stats1 -a sum -f x,y,z
x_sum y_sum z_sum
8     11    15

We really should have similar semantics between stats1 and +, and I'm amazed (and saddened) I didn't notice this before.

This would be a slightly backward-incompatible change although (for strict semantic versioning) I would hesitate to call this Miller 7.0.0 -- we can put this in Miller 6.3.0.

johnkerl avatar Mar 24 '22 02:03 johnkerl

Note mlr step behaves like mlr stats1:

$ mlr --c2p --from input.csv step -a rsum -f x,y,z
x y z x_rsum y_rsum z_rsum
3 6 - 3      6      -
2 - 7 5      -      7
2 - - 7      -      -
1 5 8 8      11     15

johnkerl avatar Mar 24 '22 11:03 johnkerl

Design:

https://miller.readthedocs.io/en/latest/reference-main-null-data/#rules-for-null-handling

In that wording (which goes back a long time) I clearly intended empty plus number is number. So the code (C and Go) and the docs are consistent. The only thing that's inconsistent (and very much so) is how this compares columnwise summations like stats1 or step.

Workaround:

$ mlr --c2p --from input.csv put '$sum = ($x???0) + ($y???0) + ($z???0)'
x y z sum
3 6 - 9
2 - 7 9
2 - - 2
1 5 8 14

johnkerl avatar Mar 24 '22 11:03 johnkerl

$ mlr --c2p --from input.csv put '$sum = ($x???0) + ($y???0) + ($z???0)'

@johnkerl what can I read to understand the meaning of ???0

Thank you

aborruso avatar Mar 24 '22 11:03 aborruso

Design:

https://miller.readthedocs.io/en/latest/reference-main-null-data/#rules-for-null-handling

In that wording (which goes back a long time) I clearly intended empty plus number is number. So the code (C and Go) and the docs are consistent. The only thing that's inconsistent (and very much so) is how this compares columnwise summations like stats1 or step.

I disagree. In that page, you literally say:

Empty values are explicit in the data so they should explicitly affect accumulations: mlr put '@sum += $x' should accumulate numeric x values into the sum but an empty x, when encountered in the input data stream, should make the sum non-numeric. To work around this you can use the is_not_null function as follows: mlr put 'is_not_null($x) { @sum += $x }'

"an empty x should make the sum non-numeric"... is just opposite of your words in this message "empty plus number is number".

The point is that it make sense that an empty value should affect operations and an absent not. But in homogeneous CSV files, you cannot distinguish empty from absent :-(

Poshi avatar Mar 24 '22 12:03 Poshi

Thanks @Poshi ! You're right, you found another inconsistency.

To be clear though -- I very much think this change needs to be made; I'm in agreement. Thanks for helping me correctly catalog various things written in various places over the years -- helping to find the doc bits (as well as the code, obviously) that all need to be changed to make this happen.

johnkerl avatar Mar 24 '22 13:03 johnkerl

$ mlr --c2p --from input.csv put '$sum = ($x???0) + ($y???0) + ($z???0)'

@johnkerl what can I read to understand the meaning of ???0

Thank you

@aborruso https://miller.readthedocs.io/en/latest/reference-dsl-builtin-functions/index.html#_30

johnkerl avatar Mar 27 '22 15:03 johnkerl