perspective icon indicating copy to clipboard operation
perspective copied to clipboard

Proposal to evaluate expression columns based on dependency ordering

Open ArashPartow opened this issue 10 months ago • 2 comments

Currently, columns that have associated expressions are evaluated in the order they reside within the t_config 's m_expressions member. Where the ordering in the vector is derived from t_view_config 's m_expressions vector, etc. - in short, arbitrary, presumably the order of the columns from left to right.

I would like to propose that during an update, in the event there is a dependency in one or more of the expressions on another column which itself is also an expression, the column that is the dependency should be evaluated before any of the columns that are dependent on it.

As an example in the following video, taxes are computed in one column, we may want to add another column that is the discount-tax which is a combination of tax to pay, based on the sales amount, state (or city) and month (as some states discount taxes temporarily for a given month) which is then applied to the computed tax calculation.

In this situation, the second expression would always need to be evaluated after the tax column has been computed.

var discount_factor_perc :=
    "Sales" >= 10^7 ?
       switch
       {
          case "State" == 'California' and "Monthdate" == '2023/12' : 10; // 10% only during Dec 2023
          case "City"  == 'Yonkers'                                 : 90; // 90% always - can't refuse
          default                                                   :  0; //  0%
       } : 0;

tax * (1 - discount_factor_perc / 100);

Description: Where Sales are greater than $10mil and, the state is California the discount is 10% if the city is Yonkers then the discount is 90%, otherwise there is no tax discount.

  • Why not just do the discount factor calc in the tax expression? Because we would like to see both values: the tax amount and the discounted tax amount

  • Why not simply copy-n-paste the tax calculation into the discount factor calc? Because of separation of concerns, and also the actual tax computation may not be as trivial as the one shown in the demo, I've already got 99 problems, now remembering to update at least another column when the tax computation changes is just one more.


The solution to this would be to build a DAG covering the expressions and then to topologically order the expressions in m_expressions.

Expression dependencies and cycle detection for DAG construction can be determined by using the dependent_entity_collector, more details can be found in Section 16 of the readme: https://www.partow.net/programming/exprtk/readme.html#line_2951

ArashPartow avatar Mar 16 '25 23:03 ArashPartow

Thanks @ArashPartow , I think this is a good idea and I'm in support. Some general thoughts:

  • We've introduced some functions (#2374) which have dynamic column dependencies. We could maybe special-case expressions that use these functions to just implicitly depend on "all" columns that would not create cycles (and maybe also add error handling in the ExprTK functions?). We'd also like to greatly expand on these functions, e.g. vlookup() to another Table, that may be impacted (or supported) by general dependency resolution.

  • An old (pre-ExprTK) version of Perspective supported a simpler version of this - expressions in the expression array could refer to expression that had already been defined previously, but not after. However, we removed it when porting to ExprTK, mainly due to the resulting UX complexity. validating and recovering from reference cycles, dealing with expressions which can't be deleted because they are dependents, etc. I'm not certain if this is as much of an issue today, as <perspective-viewer> expression editor has some error reporting and auto completion, but we'd need to work this out concurrently to enabling the feature in-engine.

  • Expression columns are currently calculated in parallel (in Python an Rust), this change would introduce some pipelining between dependency groups that is worth understanding and benchmarking.

texodus avatar Mar 17 '25 16:03 texodus

@texodus thanks for the clarifications. I will definitely need to think about this a little more, as the vlookup function throws a wrench in my proposal to simply determine dependents and evaluation order accordingly.

As an example the following is part of a larger expression:

vlookup("Sales" > 1e6 'taxv1' ? 'taxv2', index)

Initially, the collector currently doesn't collect literals - it only collects variables (scalars, strings, vectors), so taxv1 and taxv2 even though they are column names (dynamic or otherwise) would not be collected. I can update the library to collect literals (scalars and strings), we could add some logic post parsing to take the literals that are equivalent to column names into account when determining dependencies, but there would be many scenarios where this will fail, such as when the string literals are not being used in a vlookup context.

The next issue is similar to the previous:

vlookup('tax' + "Sales" > 1e6 ? 'v1' 'v2', index)

Even if literals were being collected, statements such as the one above where the underlying string is constructed at runtime and based on some changing condition, would be difficult to resolve correctly. This is similar to the issue of searching a codebase for a class, type or variable that is constructed during the processor step using macros which themselves are switched on previously defined ifdefs.

wrt the evaluation of expression columns being done in parallel, are we not already seeing issues in regards to correctness or are things resolved because eventually all the columns become consistent?

Moreover, given one can assign values to the current row index of the column being processed, wouldn't that create inconsistencies?

For example, in the market example, one could add the following nonsense expression to have the buy side of the book be eliminated.

"price" := ("side"=='buy') ? 0 : "price"

Depending on other expression columns, if the expression column is processed before or after expression columns that would reference the buy prices would result in different values post every update - and looking at the rust code a little, if two different windows of the same tables are opened, it's possible they may be evaluated in different orders leading to differing results between the different windows over the same view.

@texodus would you be able to provide links to the previous discussions around this topic, as I'd like to get an idea of what was discussed and which options were attempted etc, either way lots to think about. 😄

ArashPartow avatar Mar 20 '25 20:03 ArashPartow