circlecore icon indicating copy to clipboard operation
circlecore copied to clipboard

fix: return entire array if no outliers are found

Open gengotech opened this issue 1 year ago • 9 comments

fix in the filter_outliers method, with previous behavior if no outliers were found, an empty array would have been returned. this results in adjusted ur calculation being None in some replays (while normal ur calculation would be fine)

this conflicts with the usage and documentation of the method, which states that it'd return the array with the outliers removed.

to fix this, we simply return the original array if no outliers were found.

gengotech avatar Feb 16 '24 06:02 gengotech

am I misreading or doesn't this return the entire array if everything is an outlier? if nothing if an outlier, then this function still returns the entire array.

I think it's actually somewhat valid to return an invalid adjusted ur when everything is an outlier (because it's very unreliable if everything is an outlier), but open to opinions.

tybug avatar Feb 16 '24 18:02 tybug

if everything is an outlier?

i don't see how that's possible with how interquartile range was implemented, could you provide an example?

gengotech avatar Feb 16 '24 20:02 gengotech

I mean, I'm just reading the code as written:

    arr_without_outliers = [x for x in arr if lower_limit < x < upper_limit]
    return arr if not arr_without_outliers else arr_without_outliers

this returns the original array arr if arr_without_outliers is empty (i.e. the array was empty to begin with, or everything in the array is an outlier)

tybug avatar Feb 16 '24 20:02 tybug

put another way: this function in master already does what this pr says, which is return the entire array when there are no outliers. so something is being lost in translation here

tybug avatar Feb 16 '24 20:02 tybug

skipping semantics, as it is in master an .ur(adjusted=True) calculation will return None if there are no outliers

if that's an intended outcome then you can close the pr, per personal preference i need it to return the same value as adjusted=False instead of None and merged downstream incase it wasnt intended (wasnt clear to me)

gengotech avatar Feb 16 '24 20:02 gengotech

skipping semantics, as it is in master an .ur(adjusted=True) calculation will return None if there are no outliers

do you have an example? I can't reproduce this. If this is the case, that's definitely a bug, like you mentioned.

tybug avatar Feb 16 '24 20:02 tybug

do you have an example? I can't reproduce this. If this is the case, that's definitely a bug, like you mentioned.

it usually happens at very low cvUR, one example would be this replay

it has lower limit of 2 and upper limit of 2 so [x for x in arr if lower_limit < x < upper_limit] would return an empty array (and thus UR calc would return None)

another option could be including the boundary with lower_limit <= x <= upper_limit

image

gengotech avatar Feb 16 '24 20:02 gengotech

That's a case where everything is an outlier, not nothing.

fwiw, I get nan and an error for normal and adjusted ur respectively for that replay:

r = cg.ReplayMap(1711983, 33839341)
print(cg.ur(r)) # nan
print(cg.ur(r, adjusted=True)) # error
...
  File "/opt/homebrew/lib/python3.12/site-packages/numpy/lib/function_base.py", line 4830, in _quantile
    slices_having_nans = np.isnan(arr[-1, ...])
                                  ~~~^^^^^^^^^
IndexError: index -1 is out of bounds for axis 0 with size 0

I'm fine with including the endpoints for outlier filtering to avoid this in the case of 0 iqr. We also probably want to raise an error when calculating the ur of a replay with < 3 hits (I thought we already did, to be honest).

tybug avatar Feb 16 '24 20:02 tybug

fwiw, I get nan and an error for normal and adjusted ur respectively for that replay:

Huh, weird. I ran your script and could not reproduce the results on circleguard 5.4.1, I get an UR calculation of 4.705853985027059 and the adjusted UR d oes return nan with the following error

venv/lib/python3.11/site-packages/numpy/core/_methods.py:206: RuntimeWarning: Degrees of freedom <= 0 for slice
  ret = _var(a, axis=axis, dtype=dtype, out=out, ddof=ddof,
venv/lib/python3.11/site-packages/numpy/core/_methods.py:163: RuntimeWarning: invalid value encountered in divide
  arrmean = um.true_divide(arrmean, div, out=arrmean,
venv/lib/python3.11/site-packages/numpy/core/_methods.py:198: RuntimeWarning: invalid value encountered in scalar divide
  ret = ret.dtype.type(ret / rcount)
nan

gengotech avatar Feb 17 '24 11:02 gengotech