miller
miller copied to clipboard
Summing up empty data
I found a small lack of documentation and an unexpected result while adding data.
I'm using an input in homogeneous CSV format, so all data is either empty or present and defined, but never absent. It could be something like:
x,y,z
3,6,
2,,7
2,,
1,5,8
I'm trying to perform the addition of these three fields, expecting mlr
to interpret the empty fields as zeroes, like if they were absent, to minimize keystrokes (following the policy of simplifying the life to the user with non-present data). These are the results:
$ mlr --csv --from input.csv put '$sum = $x + $y + $z'
x,y,z,sum
3,6,,
2,,7,
2,,,
1,5,8,14
Only when all the data is non_empty the addition is performed. From the documentation I can see that this is expected. The reasoning behind is that conceptually, absent and empty are different things and have different meanings, but when you are in a homogeneous CSV context where absents are not possible, we hit this issue. In my specific case, the columns are file sizes, and the empty ones are for files that does not exist (so they are absent), but the only possible representation is be leaving the blank in the file. Well, until this point it is just food for though (decisions and implementation has already been done so...).
But I also identified a point in the documentation that can be easily expanded and will help clarify this point:
I saw that in reference-main-null-data
help page, the table at the end (arithmetic rules) does not contain an entry for empty
data, only for absents and errors. If the behavior must be different from absent, it should be included here too.
@Poshi this absolutely sounds like a miss, perhaps in the Go port.
I absolutely treated empties as zeroes, documented that, regression-tested for it.
Let me see what I missed in the port.
@Poshi I stand corrected. Miller 5 -> 6 (C -> Go) didn't change behavior here. What I was recalling above as being "absolutely" true was the behavior for stats1
-- which was (and is) inconsistent not from Miller 5 to Miller 6, but rather, between stats1
and +
.
$ mlr5 --c2p --from input.csv put '$sum = $x + $y + $z'
x y z sum
3 6 - -
2 - 7 -
2 - - -
1 5 8 14
$ mlr --c2p --from input.csv put '$sum = $x + $y + $z'
x y z sum
3 6 - -
2 - 7 -
2 - - -
1 5 8 14
$ mlr5 --c2p --from input.csv stats1 -a sum -f x,y,z
x_sum y_sum z_sum
8 11 15
$ mlr --c2p --from input.csv stats1 -a sum -f x,y,z
x_sum y_sum z_sum
8 11 15
We really should have similar semantics between stats1
and +
, and I'm amazed (and saddened) I didn't notice this before.
This would be a slightly backward-incompatible change although (for strict semantic versioning) I would hesitate to call this Miller 7.0.0 -- we can put this in Miller 6.3.0.
Note mlr step
behaves like mlr stats1
:
$ mlr --c2p --from input.csv step -a rsum -f x,y,z
x y z x_rsum y_rsum z_rsum
3 6 - 3 6 -
2 - 7 5 - 7
2 - - 7 - -
1 5 8 8 11 15
Design:
https://miller.readthedocs.io/en/latest/reference-main-null-data/#rules-for-null-handling
In that wording (which goes back a long time) I clearly intended empty plus number is number. So the code (C and Go) and the docs are consistent. The only thing that's inconsistent (and very much so) is how this compares columnwise summations like stats1
or step
.
Workaround:
$ mlr --c2p --from input.csv put '$sum = ($x???0) + ($y???0) + ($z???0)'
x y z sum
3 6 - 9
2 - 7 9
2 - - 2
1 5 8 14
$ mlr --c2p --from input.csv put '$sum = ($x???0) + ($y???0) + ($z???0)'
@johnkerl what can I read to understand the meaning of ???0
Thank you
Design:
https://miller.readthedocs.io/en/latest/reference-main-null-data/#rules-for-null-handling
In that wording (which goes back a long time) I clearly intended empty plus number is number. So the code (C and Go) and the docs are consistent. The only thing that's inconsistent (and very much so) is how this compares columnwise summations like
stats1
orstep
.
I disagree. In that page, you literally say:
Empty values are explicit in the data so they should explicitly affect accumulations: mlr put '@sum += $x' should accumulate numeric x values into the sum but an empty x, when encountered in the input data stream, should make the sum non-numeric. To work around this you can use the is_not_null function as follows: mlr put 'is_not_null($x) { @sum += $x }'
"an empty x should make the sum non-numeric"... is just opposite of your words in this message "empty plus number is number".
The point is that it make sense that an empty value should affect operations and an absent not. But in homogeneous CSV files, you cannot distinguish empty from absent :-(
Thanks @Poshi ! You're right, you found another inconsistency.
To be clear though -- I very much think this change needs to be made; I'm in agreement. Thanks for helping me correctly catalog various things written in various places over the years -- helping to find the doc bits (as well as the code, obviously) that all need to be changed to make this happen.
$ mlr --c2p --from input.csv put '$sum = ($x???0) + ($y???0) + ($z???0)'
@johnkerl what can I read to understand the meaning of
???0
Thank you
@aborruso https://miller.readthedocs.io/en/latest/reference-dsl-builtin-functions/index.html#_30