datatable icon indicating copy to clipboard operation
datatable copied to clipboard

statistical reducers output mismatch

Open pradkrish opened this issue 4 years ago • 5 comments

The output for statistical reducers when used in the square brackets syntax does not match that of Frame methods.

srcs =  ["a", "bc", "def", None, -2.5, 3.7] / dt.obj64
df = dt.Frame(srcs)
RES1 = df.min()
RES2 = df[:, dt.min(f[:])] # What is the expected output here?

RES1 and RES2 should match but RES2 gives TypeError: Unable to apply reduce function min() to a column of type obj64 instead. This mismatch is seen for max, mean and likely for other reducers too. As written in the comments in line 4, what is the expected output there?

pradkrish avatar Jun 26 '21 15:06 pradkrish

This is by design. The stats functions such as mean() or max(), when called as methods of a Frame, will apply to each column of that frame and produce a frame of shape ncols x 1 in the result. If the stat function is not applicable to a particular column, there will be an NA in that place in the result.

The regular stat functions such as dt.mean() or dt.max() behave differently: they can apply to one or more columns given as the argument, but if one of those columns has an incorrect type, an exception will be raised. This is better from the user's perspective: if a column has an unexpected type, it's better to catch that early.

st-pasha avatar Jun 26 '21 20:06 st-pasha

In that case, is it okay if the outputs for min, and other reducers, calculated in two different ways do not match if the frames are of type obj64?

pradkrish avatar Jun 26 '21 20:06 pradkrish

@st-pasha Frankly, I don't understand why behavior should be different. Frame can also consists from one or more columns, and a new frame can easily be constructed that would only contain columns with the valid column types...

Btw, output in the first case looks strange:

>>> RES1 = df.min()
>>> RES1
   | C0       
   | obj64    
-- + ---------
 0 | <unknown>
[1 row x 1 column]

From what you're saying, I would expect that to be NA.

oleksiyskononenko avatar Jun 26 '21 22:06 oleksiyskononenko

Yeah, this simply because our handling of Obj columns is really bad:

>>> dt.Frame(range(5), type=object)
   | C0       
   | obj64    
-- + ---------
 0 | <unknown>
 1 | <unknown>
 2 | <unknown>
 3 | <unknown>
 4 | <unknown>
[5 rows x 1 column]

We simply don't have any better way of rendering these obj values.

Ideally, they should be repr-d here, but we don't have that logic implemented yet.

st-pasha avatar Jun 28 '21 08:06 st-pasha

Can you write an issue for this in which we can discuss how outputs for obj64 columns should look like? Once that is resolved, we can take up how to handle obj64 columns with regards to reducers used within square brackets. is it okay to defer it till then?

pradkrish avatar Jun 28 '21 14:06 pradkrish