datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Only recompute schema in `TypeCoercion` when necessary

Open alamb opened this issue 1 year ago • 2 comments

Draft as it builds on https://github.com/apache/datafusion/pull/10356

Which issue does this PR close?

closes https://github.com/apache/datafusion/issues/10365

Rationale for this change

recomputing the Schema can be expensive, so only do it when necessary

What changes are included in this PR?

  1. Update TypeCoercionRewriter to track when the expression is actually rewritten (via Transformed::yes)
  2. Only recompute the schema when an expression was rewritten

This is a big change change as it needs to is properly track when expressions were changed which required updating the entire expression rewrite

Are these changes tested?

Existing CI

Are there any user-facing changes?

No functional changes

TODO performance tests

alamb avatar May 03 '24 15:05 alamb

There are a few tests that don't pass for reason's I haven't debugged yet but it looks like recomputing schemas may be masking issues elsewhere. Once https://github.com/apache/datafusion/pull/10356 is merged I'll do another profiling run to figure out if recomputing schema shows up in the traces and will keep hacking on this if so

alamb avatar May 03 '24 16:05 alamb

Update here is I don't think I'll have a chance to work on this in the near term unless recomputing the schema shows up in planning benchmarks again.

I think given the level of effort involved, this approach is not likely to yield a large speedup (though I do think this PR's approach would be an improvement over the current code)

alamb avatar May 17 '24 08:05 alamb

I realistically don't plan to spend any more time on this :(

alamb avatar Jun 15 '24 16:06 alamb