dash icon indicating copy to clipboard operation
dash copied to clipboard

[Feature Request] Performance improvement of the `style_data_conditional` when using `column_id` and `row_index` selectors

Open TomaszRewak opened this issue 2 years ago • 2 comments

Hey, The current implementation of the style_data_conditional property in the DataTable component is quite flexible and allows for creating rich visualization. Unfortunately it's also not very performant when the number of rules is high. That's because the DataTable engine evaluates every single rule for every single cell. This becomes especially troublesome when in multi-column tables we want to emulate heatmap formatting. Let's first consider a simple example of going the recommended path and assigning each cell value to a predetermined color bucket (instead of having a continuous color scale). If we have 6 colors and 20 columns we want to style, we end up checking 120 conditional rules for every cell within our table. Going further, if we want to have a continuous color scale (and not discrete color buckets), the way of achieving it is to manually assign a color to each column_id:row_index pair - which results in NM number of rules and therefor (NM)^2 time complexity of rendering the table.

My proposal would be to optimize the rule evaluation process without altering the current DataTable behavior. It can be done by storing style conditions that define column_id or row_index (or both) in lookup tables instead of keeping them in a single IConvertedStyle array. This would mean that the DataTable engine would have to evaluate only the conditions relevant to the current scope - for example only the conditions defined for the current column - instead of going through the entire list. This would especially shine for use cases like the continuous heatmap coloring - where the initial (NM)^2 time complexity would be reduced to a simple NM.

Of course this change, as simple as it may sound, will require some work as the current implementation is relatively generic and has high abstraction level, so some refactoring will be required there. One will also have to make sure that all edge cases are covered (for example that the order within which conditions with and without column_id are evaluated is preserved).

As I have a relatively big use case for this change, I would be more than happy to implement it myself. Before I start working on it, I just wanted to check if contributions like this are welcomed and if no one has started working on a similar improvement within the plotly team yet (did not see any mention of it in the community forum).

Btw, I really enjoy this framework!

TomaszRewak avatar May 19 '22 16:05 TomaszRewak

Thanks @TomaszRewak - sounds like a smart change to make, and we'd more than welcome a contribution to improve this! I'm not aware of anyone working on this or anything near it.

The only thing I'd ask - in addition to the edge cases you already mentioned - is to include a test or two where a style_data_conditional prop similar to your use case is applied and then modified by a callback (if we don't already have such a test).

alexcjohnson avatar May 19 '22 20:05 alexcjohnson

Thanks!

After looking through the code and considering different scenarios, I think the most tricky part of this change will be to properly optimize the filter_query expressions - which is essential for this improvement.

If one has a style with the following query: {Region} = "Toronto", it would be good to evaluate it only when the value of that field is indeed equal to "Toronto". But of course that's only a simple example and we need to be sure not to break conditions like {Region} = "Toronto" || {Temperature} > 0 - which needs to be evaluated for all of the rows.

My plan is to statically evaluate the filter_query syntax tree under an assumption that every field_expression = value_expression and value_expression = field_expression node evaluates to false. If under that condition the entire expression statically evaluates to false, I will be evaluating it only for rows where at least one of the substituted conditions is true (by storing those styles in a (field:value)=>style lookup table).

For example:

{Region} = "Toronto" -(substitution)-> False -(evaluation)-> False => Evaluate only when {Region} = "Toronto"

{Region} = "Toronto" || {Humidity} = 30 -(substitution)-> False || False -(evaluation)-> False => Evaluate only when {Region} = "Toronto" or {Humidity} = 30

{Region} = "Toronto" && {Temperature} > 0 -(substitution)-> False && Unknown -(evaluation)-> False => Evaluate only when {Region} = "Toronto"

{Region} = "Toronto" || {Temperature} > 0 -(substitution)-> False || Unknown -(evaluation)-> Unknown => Always evaluate

Hope that sounds reasonable.

TomaszRewak avatar May 22 '22 21:05 TomaszRewak