datatable icon indicating copy to clipboard operation
datatable copied to clipboard

Refactor Expr architecture

Open st-pasha opened this issue 5 years ago • 5 comments

Our current approach for creating new Exprs is overly complicated, as outlined here: https://github.com/h2oai/datatable/blob/master/src/core/expr/!readme.md. This complexity stems mostly from the fact that the Expr class which performs arithmetic on f-expressions is defined in pure python, and then needs to be bridged into the C++ core.

A more sane approach would be to define everything in C++, eliminating most of the "middle-man" code. In particular, the following architecture is proposed:

  • C++ class py::FExpr to replace current python Expr class;
  • C++ class py::ColumnNamespace to replace current python FrameProxy class;
  • C++ class dt::expr::FExpr is a merged version of current dt::expr::Expr and dt::expr::Head. The class is virtual, with the hierarchy following that of the Head class;
  • Each py::FExpr contains a shared_ptr<dt::expr::FExpr>;
  • The dt::expr::FExpr class defines virtual methods for evaluation and reproing;
  • The Op enum is removed.

subtasks

  • [x] Add support for numeric and comparison methods in py::XObject<C>;
  • [x] Create class py::FExpr (which will eventually replace the pure-python datatable.expr.Expr);
  • [x] Create class dt::expr::FExpr which is a backend for py::FExpr;
  • [x] Arrange so that new FExprs can be used alongside old pure-python Exprs;
  • [x] Create class py::Namespace to replace pure-python datatable.expr.FrameProxy;
  • [ ] Convert existing OldExpr-based functionality into FExprs:
    • [x] Frame-expr;
    • [x] List-expr;
    • [x] Dict-expr;
    • [x] Literal exprs:
      • [x] None;
      • [x] bool;
      • [x] int;
      • [x] float;
      • [x] str;
      • [x] type;
      • [x] range;
      • [x] slice (all);
      • [x] slice (numeric);
      • [x] slice (string);
    • [x] Column selectors f.A / f[0];
    • [x] f.extend();
    • [x] f.remove();
    • [x] Cast functions;
    • [ ] shift();
    • [x] ifelse();
    • [x] cut();
    • [x] qcut();
    • [x] Arithmetic binary operators
      • [x] +;
      • [x] -;
      • [x] *;
      • [x] /;
      • [x] //;
      • [x] %;
      • [x] **;
    • [ ] Bitwise binary operators
      • [ ] &;
      • [ ] |;
      • [ ] ^;
      • [ ] <<;
      • [ ] >>;
    • [ ] Unary operations
      • [ ] +;
      • [ ] -;
      • [ ] ~;
    • [x] Comparison operators
      • [x] <;
      • [x] >;
      • [x] <=;
      • [x] >=;
      • [x] ==;
      • [x] !=;
    • [x] String methods
      • [x] len()
      • [x] re_match();
    • [ ] Reducers
      • [x] mean,
      • [x] min,
      • [x] max,
      • [ ] stdev,
      • [ ] first,
      • [ ] last,
      • [x] sum,
      • [x] count,
      • [x] count0,
      • [ ] median,
      • [ ] cov,
      • [ ] corr;
    • [ ] Math functions:
      • [ ] Trigonometric
        • [ ] sin,
        • [ ] cos,
        • [ ] tan,
        • [ ] arcsin,
        • [ ] arccos,
        • [ ] arctan,
        • [ ] arctan2,
        • [ ] hypot,
        • [ ] deg2rad,
        • [ ] rad2deg;
      • [ ] Hyperbolic
        • [ ] sinh,
        • [ ] cosh,
        • [ ] tanh,
        • [ ] arsinh,
        • [ ] arcosh,
        • [ ] arcosh;
      • [ ] Exponential
        • [ ] cbrt,
        • [ ] exp,
        • [ ] exp2,
        • [ ] expm1,
        • [ ] log,
        • [ ] log10,
        • [ ] log1p,
        • [ ] log2,
        • [ ] logaddexp,
        • [ ] logaddexp2,
        • [ ] pow,
        • [ ] sqrt,
        • [ ] square;
      • [ ] Special
        • [ ] erf,
        • [ ] erfc,
        • [ ] gamma,
        • [ ] lgamma;
      • [ ] Floating
        • [ ] abs,
        • [ ] ceil,
        • [ ] copysign,
        • [ ] fabs,
        • [ ] floor,
        • [ ] frexp,
        • [ ] isclose,
        • [ ] isfinite,
        • [ ] isinf,
        • [ ] isna,
        • [ ] ldexp,
        • [ ] modf,
        • [ ] rint,
        • [ ] sign,
        • [ ] signbit,
        • [ ] trunc;
      • [ ] Miscellaneous
        • [ ] clip,
        • [ ] divmod,
        • [ ] fmod,
        • [ ] maximum,
        • [ ] minimum;
    • [x] Row-functions:
      • [x] rowall,
      • [x] rowany,
      • [x] rowcount,
      • [x] rowfirst,
      • [x] rowlast,
      • [x] rowmin,
      • [x] rowmax,
      • [x] rowmean,
      • [x] rowsum,
      • [x] rowsd;
  • [x] Documentation:
    • [x] Update documentation on how to work with new FExpr infrastructure ("expr/!readme.md");
    • [x] Add API documentation for the py::Namespace class;
    • [x] Add API documentation for the py::FExpr class;
  • [ ] Final cleanup:
    • [x] Remove python class datatable.expr.FrameProxy;
    • [ ] Remove python class datatable.expr.Expr;
    • [ ] Remove python enum datatable.expr.OpCodes;
    • [ ] Remove the dt::expr::Op enum;
    • [ ] Remove the dt::expr::OldExpr class;
    • [ ] Remove args_registry.

st-pasha avatar Aug 06 '20 20:08 st-pasha

Hi @st-pasha, I think it would be helpful as well if a documentation could be created for code contributors , to know what the building blocks are, so to speak. That way when dabbling in and writing the code, I can sort of know which parts to use and what. Admittedly, my knowledge of C++ is limited, so maybe the documentation is clear and the challenge is with me.

Also, with this new refactoring, is it advisable for code contributors to wait, till the refactoring is complete?

samukweku avatar Aug 08 '20 08:08 samukweku

Hi @samukweku, this new refactoring effort aims at simplifying code inside the core/expr folder. It is quite a significant part of the codebase, but it's not the only part. So if you want to try and contribute code, feel free to look around. For example, #2166 is both relatively simple and completely independent of the Expr architecture. On the other hand if you wanted to do any PR that involves creating new functions that act within DT[...], then it's best to wait a week or so until this upgrade is complete.

st-pasha avatar Aug 08 '20 08:08 st-pasha

Hi @st-pasha, is it possible to include case-insensitive string contains functionality in the new Expo architecture?

Peter-Pasta avatar Oct 12 '20 08:10 Peter-Pasta

Sure, everything is possible, though making it properly handle international text would be very tricky... However, we can start with an ASCII version first.

Do you have a specific API in mind for this function (i.e. function name, parameters, etc)?

st-pasha avatar Oct 12 '20 16:10 st-pasha

Hi @st-pasha, I don't have a specific API in mind, although I would be looking forward to using this with the upcoming case_when functionality.

Peter-Pasta avatar Oct 12 '20 20:10 Peter-Pasta