visidata
visidata copied to clipboard
[numeric-binning] Null values cause stack trace or incorrect counts
Small description Frequency Sheet has issues with numeric-binning and null values.
Expected result Null values have their own row with a count.
Actual result with screenshot
https://asciinema.org/a/Mit8iWUp7IJktIUdgh8O5VpyG
Includes steps with and without numeric_binning turned on. The error is seen with numeric_binning
Steps to reproduce with sample data and a .vd
vd --numeric_binning=True sample_data/y77d-th95.json.gz
On the geolocation column run the z#
type-len
command.
Then hit the F
command to get a frequency sheet.
The count is wrong, it should have 988 for 2, and 12 for 0. But we see 1000 for 2.
If we limit the rows such that we only look at some rows with null for geolocation and hit the "F" key we get this error:
Stack Trace:
Traceback (most recent call last):
File "/usr/lib/python3.9/site-packages/visidata/threads.py", line 206, in _toplevelTryFunc
t.status = func(*args, **kwargs)
File "/usr/lib/python3.9/site-packages/visidata/pivot.py", line 178, in groupRows
minval = min(vals)
ValueError: min() arg is an empty sequence
Bonus Bug/Stack Trace Easter Egg bug found as part of trying to reproduce:
echo '{"a":[1]}' | vd -f json
Stack Trace:
Traceback (most recent call last):
File "/usr/lib/python3.9/site-packages/visidata/column.py", line 343, in getCell
dw.display = self.format(typedval) or ''
File "/usr/lib/python3.9/site-packages/visidata/column.py", line 215, in format
return self._formatMaker(self._formatdict)(*args, **kwargs)
File "/usr/lib/python3.9/site-packages/visidata/column.py", line 224, in formatValue
return f'[{len(typedval)}] ' + '; '.join(typedval[0:10])
TypeError: sequence item 0: expected str instance, int found
Additional context Please include the version of VisiData. Latest on develop.
Good catch, @frosencrantz. The numeric binning code is thorny, so this may not get fixed right away, but it's definitely a bug.
A start:
groupRows
(line 163), is collecting all the vals with vals = tuple(numericCols[0].getValues(self.source.rows))
.
getValues
calls getValueRows
(line 11 in aggregators.py), which by design is excluding rows with null
:
11 def getValueRows(self, rows):
12 'Generate (value, row) for each row in *rows* at this column, excluding null and error values.'
13 f = self.sheet.isNullFunc()
14
15 for r in Progress(rows, 'calculating'):
16 try:
17 v = self.getTypedValue(r)
18 if not f(v):
19 yield v, r
This behaviour makes sense for getValueRows
because it is used by functions that are not expecting null
or error
values.
e.g.
vd.aggregators['keymax'] = _defaggr('keymax', anytype, lambda col, rows: col.sheet.rowkey(max(col.getValueRows(rows))[1]), 'key of the maximum value')
This confused me because we have https://github.com/saulpw/visidata/blob/develop/tests/freq-error.vd which shows the binning of errors and nulls
. But maybe this was never supported for numeric columns.
Edit We support the binning of error
and null
for categorical binning! The bug (and missing support) is specific to numeric binning. Of course.
z#
also does not have a 0
value it seems. #
and %
with numeric binning handle this situation a bit better, but still not ideally.
For this issue, we want to expand numericBinning
to have one row for each error type, and one row for nulls.
Okay @frosencrantz, this should be fixed now! Let us know if numerical binning seems to have other issues.