visidata icon indicating copy to clipboard operation
visidata copied to clipboard

[numeric-binning] Null values cause stack trace or incorrect counts

Open frosencrantz opened this issue 2 years ago • 4 comments

Small description Frequency Sheet has issues with numeric-binning and null values.

Expected result Null values have their own row with a count.

Actual result with screenshot

https://asciinema.org/a/Mit8iWUp7IJktIUdgh8O5VpyG

Includes steps with and without numeric_binning turned on. The error is seen with numeric_binning

Steps to reproduce with sample data and a .vd

vd --numeric_binning=True sample_data/y77d-th95.json.gz

On the geolocation column run the z# type-len command. Then hit the F command to get a frequency sheet. The count is wrong, it should have 988 for 2, and 12 for 0. But we see 1000 for 2.

If we limit the rows such that we only look at some rows with null for geolocation and hit the "F" key we get this error:

Stack Trace:

Traceback (most recent call last):
  File "/usr/lib/python3.9/site-packages/visidata/threads.py", line 206, in _toplevelTryFunc
    t.status = func(*args, **kwargs)
  File "/usr/lib/python3.9/site-packages/visidata/pivot.py", line 178, in groupRows
    minval = min(vals)
ValueError: min() arg is an empty sequence

Bonus Bug/Stack Trace Easter Egg bug found as part of trying to reproduce:

echo '{"a":[1]}' | vd -f json

Stack Trace:

Traceback (most recent call last):
  File "/usr/lib/python3.9/site-packages/visidata/column.py", line 343, in getCell
    dw.display = self.format(typedval) or ''
  File "/usr/lib/python3.9/site-packages/visidata/column.py", line 215, in format
    return self._formatMaker(self._formatdict)(*args, **kwargs)
  File "/usr/lib/python3.9/site-packages/visidata/column.py", line 224, in formatValue
    return f'[{len(typedval)}] ' + '; '.join(typedval[0:10])
TypeError: sequence item 0: expected str instance, int found

Additional context Please include the version of VisiData. Latest on develop.

frosencrantz avatar Jun 26 '22 18:06 frosencrantz

Good catch, @frosencrantz. The numeric binning code is thorny, so this may not get fixed right away, but it's definitely a bug.

saulpw avatar Jul 05 '22 05:07 saulpw

A start:

groupRows (line 163), is collecting all the vals with vals = tuple(numericCols[0].getValues(self.source.rows)).

getValues calls getValueRows (line 11 in aggregators.py), which by design is excluding rows with null:

 11 def getValueRows(self, rows):                                                                                                                  
 12     'Generate (value, row) for each row in *rows* at this column, excluding null and error values.'                                            
 13     f = self.sheet.isNullFunc()                                                                                                                
 14                                                                                                                                                
 15     for r in Progress(rows, 'calculating'):                                                                                                    
 16         try:                                                                                                                                   
 17             v = self.getTypedValue(r)                                                                                                          
 18             if not f(v):                                                                                                                       
 19                 yield v, r   

This behaviour makes sense for getValueRows because it is used by functions that are not expecting null or error values.

e.g.

 vd.aggregators['keymax'] = _defaggr('keymax', anytype, lambda col, rows: col.sheet.rowkey(max(col.getValueRows(rows))[1]), 'key of the maximum     value')     

anjakefala avatar Jul 06 '22 04:07 anjakefala

This confused me because we have https://github.com/saulpw/visidata/blob/develop/tests/freq-error.vd which shows the binning of errors and nulls. But maybe this was never supported for numeric columns.

Edit We support the binning of error and null for categorical binning! The bug (and missing support) is specific to numeric binning. Of course.

z# also does not have a 0 value it seems. # and % with numeric binning handle this situation a bit better, but still not ideally.

anjakefala avatar Jul 06 '22 05:07 anjakefala

For this issue, we want to expand numericBinning to have one row for each error type, and one row for nulls.

anjakefala avatar Jul 06 '22 05:07 anjakefala

Okay @frosencrantz, this should be fixed now! Let us know if numerical binning seems to have other issues.

saulpw avatar Oct 18 '23 00:10 saulpw