datafusion Deterministic IDs for ExecutionPlan

Is your feature request related to a problem or challenge?

Currently execution plans do not have an id associated with them this makes comparison of metrics across the runs. Additionally we would like to add a UI to our project to display the metrics as well snapshot our plans.

Looking at the code, it seems to me that the the physical_planner does record the node index which is not passed down to map_logical_node_to_physical.

Describe the solution you'd like

It would quite trivial to pass the node index to ExecutionPlan creation step and add additional field operator_id plan_id to ExecutionPlanProperties.

Describe alternatives you've considered

We looked into using task_id field, but it always seems to be set to None.

Additional context

No response

Jul 09 '24 18:07 ameyc

During planning there are couple of physical optimization rules which modify the ExecutionPlan (rules can be found in the link). Hence, I think proper place to generate id for an ExecutionPlan is the end of all optimizations where ExecutionPlan is finalized (doing this during the conversion from logical plan to physical plan may be premature).

I guess, once ExecutionPlan is finalized, we can generate an id for the plan using TreeNode API.

Jul 10 '24 10:07 mustafasrepo

@ozankabak @alamb taking a look at this again as we are working on snapshotting and UI for Denoramlized. It seems to me that the ideal place to add these is in the PlanProperties with something like

pub struct PlanProperties { pub eq_properties: EquivalenceProperties, pub partitioning: Partitioning, pub execution_mode: ExecutionMode, output_ordering: Option<LexOrdering>, node_id: Option<usize>, }

with a method on ExecutionPlan to set the node_id (as this isnt available when the compute_properties() is called on the node at creation time. fn with_node_id(&mut self, _node_id: usize) {}

Once the final physical plan is generated in create_physical_plan() after all the optimizer passes, we can then visit all the nodes and call "with_node_id" on them. Does this sound like a reasonable approach?

Aug 24 '24 00:08 ameyc

Once the final physical plan is generated in create_physical_plan() after all the optimizer passes, we can then visit all the nodes and call "with_node_id" on them. Does this sound like a reasonable approach?

I do think this sounds reasonable

Some questions and thoughts:

What would be the default value of id (if with_node_id wasn't called)?
You might want to consider an API like pub fn with_node_id(self: Arc<...>, node_id: usize if you can as the ExcutionPlan nodes are almost always Arcd)
I would recommend not showing this id in the explain plan if they change / are not deterministic
I would recommend extending one of the examples (perhaps related to planning, or make one related to streaming) showing how you use this field both as a test as well as to ensure the API works well

Aug 25 '24 11:08 alamb

Overall sounds reasonable, I will circle back tomorrow after discussing with Synnada folks. Maybe we can upstream some code to help.

Aug 25 '24 13:08 ozankabak

What would be the default value of id (if with_node_id wasn't called)?

so was thinking if this isn't set, simply letting it be None in the default. this will be mostly to account for UserDefined ExecutionPlans that haven't overridden that method.

Aug 26 '24 20:08 ameyc

Just cut a PR based on feedback from @alamb

Aug 27 '24 05:08 ameyc

We have been thinking about this, I will reply in the PR

Aug 27 '24 07:08 ozankabak

Just noticed the related PR was closed, maybe we can continue to discuss the final/general way here.

Mar 03 '25 08:03 xudong963

would love to actually get this merged in @xudong963

Mar 03 '25 22:03 ameyc

We'd love to resume iterations on https://github.com/apache/datafusion/pull/12186#issuecomment-2554359314

Mar 04 '25 07:03 berkaysynnada