Deterministic IDs for ExecutionPlan
Is your feature request related to a problem or challenge?
Currently execution plans do not have an id associated with them this makes comparison of metrics across the runs. Additionally we would like to add a UI to our project to display the metrics as well snapshot our plans.
Looking at the code, it seems to me that the the physical_planner does record the node index which is not passed down to map_logical_node_to_physical.
Describe the solution you'd like
It would quite trivial to pass the node index to ExecutionPlan creation step and add additional field operator_id plan_id to ExecutionPlanProperties.
Describe alternatives you've considered
We looked into using task_id field, but it always seems to be set to None.
Additional context
No response
During planning there are couple of physical optimization rules which modify the ExecutionPlan (rules can be found in the link).
Hence, I think proper place to generate id for an ExecutionPlan is the end of all optimizations where ExecutionPlan is finalized (doing this during the conversion from logical plan to physical plan may be premature).
I guess, once ExecutionPlan is finalized, we can generate an id for the plan using TreeNode API.
@ozankabak @alamb taking a look at this again as we are working on snapshotting and UI for Denoramlized. It seems to me that the ideal place to add these is in the PlanProperties with something like
pub struct PlanProperties { pub eq_properties: EquivalenceProperties, pub partitioning: Partitioning, pub execution_mode: ExecutionMode, output_ordering: Option<LexOrdering>, node_id: Option<usize>, }
with a method on ExecutionPlan to set the node_id (as this isnt available when the compute_properties() is called on the node at creation time.
fn with_node_id(&mut self, _node_id: usize) {}
Once the final physical plan is generated in create_physical_plan() after all the optimizer passes, we can then visit all the nodes and call "with_node_id" on them. Does this sound like a reasonable approach?
Once the final physical plan is generated in create_physical_plan() after all the optimizer passes, we can then visit all the nodes and call "with_node_id" on them. Does this sound like a reasonable approach?
I do think this sounds reasonable
Some questions and thoughts:
- What would be the default value of id (if
with_node_idwasn't called)? - You might want to consider an API like
pub fn with_node_id(self: Arc<...>, node_id: usizeif you can as the ExcutionPlan nodes are almost alwaysArcd) - I would recommend not showing this id in the explain plan if they change / are not deterministic
- I would recommend extending one of the examples (perhaps related to planning, or make one related to streaming) showing how you use this field both as a test as well as to ensure the API works well
Overall sounds reasonable, I will circle back tomorrow after discussing with Synnada folks. Maybe we can upstream some code to help.
- What would be the default value of id (if
with_node_idwasn't called)?
so was thinking if this isn't set, simply letting it be None in the default. this will be mostly to account for UserDefined ExecutionPlans that haven't overridden that method.
Just cut a PR based on feedback from @alamb
We have been thinking about this, I will reply in the PR
Just noticed the related PR was closed, maybe we can continue to discuss the final/general way here.
would love to actually get this merged in @xudong963
We'd love to resume iterations on https://github.com/apache/datafusion/pull/12186#issuecomment-2554359314