delta Fix double file scan from nested schema pruning

Description

When comparing DeltaScan filters, un-resolves and re-resolves nested field extractions from the source expression set to the target expression set. This is to get around the fact the the prepared filters are pre-optimization, and the actual filters are post optimization that can involve nested schema pruning.

Resolves #1073

How was this patch tested?

New UT.

Does this PR introduce any user-facing changes?

Yes, removes a double file scan for a scan with filters on nested columns.

Apr 22 '22 22:04 Kimahriman

@Kimahriman thanks for this PR, will need to take a thorough look.

Regarding, "Not sure how to test the actual behavior, so I just added a test for the helper method that demonstrates the issue.". Since you've stated this only happens when we apply a scan which filters on nested columns, can you write a unit test that previously had such a double scan, and then use getScanReport to assert it only scans once?

Apr 25 '22 20:04 scottsand-db

Yeah this was just kinda brainstorming an approach. At first I tried to just fully prune any expressions to only the field it cared about, but then just hit weird cases of a isnotnull(nested) and what to do about that because it's not extracting any fields. Theoretically I think this should be safe since all column names should be unique, but definitely let me know what you think or if you can come up with a better idea.

Regarding, "Not sure how to test the actual behavior, so I just added a test for the helper method that demonstrates the issue.". Since you've stated this only happens when we apply a scan which filters on nested columns, can you write a unit test that previously had such a double scan, and then use getScanReport to assert it only scans once?

Didn't know that was a thing so I'll look into that.

Apr 25 '22 20:04 Kimahriman

I guess another strategy could be gather all attributes from the actual filters and replace them in the prepared filters by expr ID?

Apr 25 '22 22:04 Kimahriman

I guess another strategy could be gather all attributes from the actual filters and replace them in the prepared filters by expr ID?

This seems like a more robust solution to me. What do you think @zsxwing ?

Apr 25 '22 23:04 scottsand-db

Also, I don't think getScanReport can catch this (after trying and seeing it not work). There's still only one FileSourceScanExec, it just does double duty in the matchingFiles method

Apr 25 '22 23:04 Kimahriman

Ah shoot not as easy as I thought, the ordinals for the struct fields will be off after that. Need to think through that alternative a little more

Apr 25 '22 23:04 Kimahriman

I think you would essentially have to "reresolve" the prepared filters so to speak, so the options I can think of right now are basically:

This approach, convert prepared and actual to unresolved
Convert the prepared to unresolved, and then try to re-resolve them with the actual attributes

Apr 26 '22 11:04 Kimahriman

I am trying catch up to the discussion and think about it a little. my biggest fear is that if we attempt to duplicate any resolution / unresolution code path, then that duplicate code path can easily become inconsistent with the actual nested schema resolution logic. hence bugs leading to incorrect set of files being read. That why till now we have erred toward reading more data and let it be filtered by the filter plan than reading less data.

May 10 '22 18:05 tdas

The "unresolve both sides" seems less likely to have any weird resolution issues, but obviously just takes one example to prove it doesn't work. Haven't thought of one yet though

May 10 '22 18:05 Kimahriman

So we spent some time looking into the issue a little bit more. A few solutions we're considering include

Comparing AttributeReferences using their exprId
Applying the rule that does the nested schema pruning to the prepared filters before checking equality
Inject PrepareDeltaScan somewhere else (our options might be very limited here)

As a next step we want to identify the rule that does the nested schema pruning (which wasn't immediately obvious from a precursory look.) To figure this out, I'm planning to set spark.sql.planChangeLog.level to warn and try to find the rule. I'm not sure when I'll have time to do this, so feel free to look into it yourself and investigate the plausibility of (2).

May 14 '22 02:05 allisonport-db

I believe it's the NestedColumnAliasing extractor in the ColumnPruning rule. So elaborating on the first two, these seem like the possibilities:

Expand on the current method a bit. Take all the AttributeReference's is the actual filters, replace them by exprId in the prepared filters, and then transform all GetStructField's to UnresolvedExtractValue on both side. This still removes data type comparisons of the nested field, but you do have the AttributeReference's to compare against at least. Alternatively, you could try to re-resolve the prepared filters with the replaced AttributeReference's, but not sure how stable that would be. Maybe as simple as transforming GetStructField to an ExtractValue post-attribute replacement?
Try to run the ColumnPruning optimization rule on both prepared and actual filters, though I'm not sure how that would work since that works on plans and not expressions.

May 15 '22 12:05 Kimahriman

Tried the first method, let me know what you think

May 15 '22 13:05 Kimahriman

Hey @Kimahriman just wanted to provide an update. I didn't forget about this, we are just very busy with the upcoming release. Will be taking another look as soon as I have the chance.

Jun 09 '22 22:06 allisonport-db

No problem! I'd rather have a quicker release for Spark 3.3 😊

Jun 09 '22 22:06 Kimahriman

Figured out a test using withLogicalPlansCaptured to ensure there's only one log scan

Jun 18 '22 13:06 Kimahriman

Hi @Kimahriman - awesome! We will finish reviewing + merging this PR after the next Delta release, which we are working hard at now. Will get back to you within 1 or 2 weeks, cheers!

Jun 21 '22 23:06 scottsand-db

The test looks good, thanks for adding that. As for the current solution I'm not completely convinced on its robustness/safeness, and I'd rather err on the side of rescanning unnecessarily. It would be much better if we could find a way to reuse the codepaths that are in the rule doing the pruning. I did take a look though and I see how that's not super accessible.

Jul 28 '22 01:07 allisonport-db

As for the current solution I'm not completely convinced on its robustness/safeness, and I'd rather err on the side of rescanning unnecessarily.

Has any scenario been thought of yet where the post-optimization filters actually change? The only thing that should happen between creating the initial scan and the final scan is optimization rules are applied. Optimization rules shouldn't change the result of the query, just the performance of the query. In this case, theoretically the only thing that should change with the filters is that more filters are applied post-optimization. If files are included post-optimization that weren't included pre-optimization, that would indicate some serious bug unrelated to this PR imo.

So in the case of a false positive, where we think the filters are the same with this logic but they are not actually, the worst thing that could happen is we use the pre-optimization scan that could include more files than necessary. And in the false negative case, we will simply recalculate the files for the scan regardless, so it's just the same performance hit of the extra log scan.

Is there anything I'm missing or not thinking of?

Jul 28 '22 12:07 Kimahriman