datatable
datatable copied to clipboard
statistical reducers output mismatch
The output for statistical reducers when used in the square brackets syntax does not match that of Frame methods.
srcs = ["a", "bc", "def", None, -2.5, 3.7] / dt.obj64
df = dt.Frame(srcs)
RES1 = df.min()
RES2 = df[:, dt.min(f[:])] # What is the expected output here?
RES1 and RES2 should match but RES2 gives TypeError: Unable to apply reduce function min() to a column of type obj64 instead. This mismatch is seen for max, mean and likely for other reducers too. As written in the comments in line 4, what is the expected output there?
This is by design. The stats functions such as mean() or max(), when called as methods of a Frame, will apply to each column of that frame and produce a frame of shape ncols x 1 in the result. If the stat function is not applicable to a particular column, there will be an NA in that place in the result.
The regular stat functions such as dt.mean() or dt.max() behave differently: they can apply to one or more columns given as the argument, but if one of those columns has an incorrect type, an exception will be raised. This is better from the user's perspective: if a column has an unexpected type, it's better to catch that early.
In that case, is it okay if the outputs for min, and other reducers, calculated in two different ways do not match if the frames are of type obj64?
@st-pasha Frankly, I don't understand why behavior should be different. Frame can also consists from one or more columns, and a new frame can easily be constructed that would only contain columns with the valid column types...
Btw, output in the first case looks strange:
>>> RES1 = df.min()
>>> RES1
| C0
| obj64
-- + ---------
0 | <unknown>
[1 row x 1 column]
From what you're saying, I would expect that to be NA.
Yeah, this simply because our handling of Obj columns is really bad:
>>> dt.Frame(range(5), type=object)
| C0
| obj64
-- + ---------
0 | <unknown>
1 | <unknown>
2 | <unknown>
3 | <unknown>
4 | <unknown>
[5 rows x 1 column]
We simply don't have any better way of rendering these obj values.
Ideally, they should be repr-d here, but we don't have that logic implemented yet.
Can you write an issue for this in which we can discuss how outputs for obj64 columns should look like? Once that is resolved, we can take up how to handle obj64 columns with regards to reducers used within square brackets. is it okay to defer it till then?