datatable
datatable copied to clipboard
Refactor Expr architecture
Our current approach for creating new Exprs is overly complicated, as outlined here: https://github.com/h2oai/datatable/blob/master/src/core/expr/!readme.md. This complexity stems mostly from the fact that the Expr class which performs arithmetic on f-expressions is defined in pure python, and then needs to be bridged into the C++ core.
A more sane approach would be to define everything in C++, eliminating most of the "middle-man" code. In particular, the following architecture is proposed:
- C++ class
py::FExprto replace current pythonExprclass; - C++ class
py::ColumnNamespaceto replace current pythonFrameProxyclass; - C++ class
dt::expr::FExpris a merged version of currentdt::expr::Expranddt::expr::Head. The class is virtual, with the hierarchy following that of theHeadclass; - Each
py::FExprcontains ashared_ptr<dt::expr::FExpr>; - The
dt::expr::FExprclass defines virtual methods for evaluation and reproing; - The
Openum is removed.
subtasks
- [x] Add support for numeric and comparison methods in
py::XObject<C>; - [x] Create class
py::FExpr(which will eventually replace the pure-pythondatatable.expr.Expr); - [x] Create class
dt::expr::FExprwhich is a backend forpy::FExpr; - [x] Arrange so that new
FExprs can be used alongside old pure-pythonExprs; - [x] Create class
py::Namespaceto replace pure-pythondatatable.expr.FrameProxy; - [ ] Convert existing
OldExpr-based functionality into FExprs:- [x] Frame-expr;
- [x] List-expr;
- [x] Dict-expr;
- [x] Literal exprs:
- [x] None;
- [x] bool;
- [x] int;
- [x] float;
- [x] str;
- [x] type;
- [x] range;
- [x] slice (all);
- [x] slice (numeric);
- [x] slice (string);
- [x] Column selectors
f.A/f[0]; - [x]
f.extend(); - [x]
f.remove(); - [x] Cast functions;
- [ ]
shift(); - [x]
ifelse(); - [x]
cut(); - [x]
qcut(); - [x] Arithmetic binary operators
- [x]
+; - [x]
-; - [x]
*; - [x]
/; - [x]
//; - [x]
%; - [x]
**;
- [x]
- [ ] Bitwise binary operators
- [ ]
&; - [ ]
|; - [ ]
^; - [ ]
<<; - [ ]
>>;
- [ ]
- [ ] Unary operations
- [ ]
+; - [ ]
-; - [ ]
~;
- [ ]
- [x] Comparison operators
- [x]
<; - [x]
>; - [x]
<=; - [x]
>=; - [x]
==; - [x]
!=;
- [x]
- [x] String methods
- [x]
len() - [x]
re_match();
- [x]
- [ ] Reducers
- [x]
mean, - [x]
min, - [x]
max, - [ ]
stdev, - [ ]
first, - [ ]
last, - [x]
sum, - [x]
count, - [x]
count0, - [ ]
median, - [ ]
cov, - [ ]
corr;
- [x]
- [ ] Math functions:
- [ ] Trigonometric
- [ ]
sin, - [ ]
cos, - [ ]
tan, - [ ]
arcsin, - [ ]
arccos, - [ ]
arctan, - [ ]
arctan2, - [ ]
hypot, - [ ]
deg2rad, - [ ]
rad2deg;
- [ ]
- [ ] Hyperbolic
- [ ]
sinh, - [ ]
cosh, - [ ]
tanh, - [ ]
arsinh, - [ ]
arcosh, - [ ]
arcosh;
- [ ]
- [ ] Exponential
- [ ]
cbrt, - [ ]
exp, - [ ]
exp2, - [ ]
expm1, - [ ]
log, - [ ]
log10, - [ ]
log1p, - [ ]
log2, - [ ]
logaddexp, - [ ]
logaddexp2, - [ ]
pow, - [ ]
sqrt, - [ ]
square;
- [ ]
- [ ] Special
- [ ]
erf, - [ ]
erfc, - [ ]
gamma, - [ ]
lgamma;
- [ ]
- [ ] Floating
- [ ]
abs, - [ ]
ceil, - [ ]
copysign, - [ ]
fabs, - [ ]
floor, - [ ]
frexp, - [ ]
isclose, - [ ]
isfinite, - [ ]
isinf, - [ ]
isna, - [ ]
ldexp, - [ ]
modf, - [ ]
rint, - [ ]
sign, - [ ]
signbit, - [ ]
trunc;
- [ ]
- [ ] Miscellaneous
- [ ]
clip, - [ ]
divmod, - [ ]
fmod, - [ ]
maximum, - [ ]
minimum;
- [ ]
- [ ] Trigonometric
- [x] Row-functions:
- [x]
rowall, - [x]
rowany, - [x]
rowcount, - [x]
rowfirst, - [x]
rowlast, - [x]
rowmin, - [x]
rowmax, - [x]
rowmean, - [x]
rowsum, - [x]
rowsd;
- [x]
- [x] Documentation:
- [x] Update documentation on how to work with new FExpr infrastructure ("expr/!readme.md");
- [x] Add API documentation for the
py::Namespaceclass; - [x] Add API documentation for the
py::FExprclass;
- [ ] Final cleanup:
- [x] Remove python class
datatable.expr.FrameProxy; - [ ] Remove python class
datatable.expr.Expr; - [ ] Remove python enum
datatable.expr.OpCodes; - [ ] Remove the
dt::expr::Openum; - [ ] Remove the
dt::expr::OldExprclass; - [ ] Remove
args_registry.
- [x] Remove python class
Hi @st-pasha, I think it would be helpful as well if a documentation could be created for code contributors , to know what the building blocks are, so to speak. That way when dabbling in and writing the code, I can sort of know which parts to use and what. Admittedly, my knowledge of C++ is limited, so maybe the documentation is clear and the challenge is with me.
Also, with this new refactoring, is it advisable for code contributors to wait, till the refactoring is complete?
Hi @samukweku, this new refactoring effort aims at simplifying code inside the core/expr folder. It is quite a significant part of the codebase, but it's not the only part. So if you want to try and contribute code, feel free to look around. For example, #2166 is both relatively simple and completely independent of the Expr architecture. On the other hand if you wanted to do any PR that involves creating new functions that act within DT[...], then it's best to wait a week or so until this upgrade is complete.
Hi @st-pasha, is it possible to include case-insensitive string contains functionality in the new Expo architecture?
Sure, everything is possible, though making it properly handle international text would be very tricky... However, we can start with an ASCII version first.
Do you have a specific API in mind for this function (i.e. function name, parameters, etc)?
Hi @st-pasha, I don't have a specific API in mind, although I would be looking forward to using this with the upcoming case_when functionality.