spark icon indicating copy to clipboard operation
spark copied to clipboard

[SPARK-42199][SQL] Fix issues around Dataset.groupByKey

Open EnricoMi opened this issue 2 years ago • 5 comments

What changes were proposed in this pull request?

Introduces ScopedExpression, which allows to resolve an expression against a set of attributes. This is used by Dataset.groupByKey.agg, Dataset.groupByKey.flatMapSortedGroups, and Dataset.groupByKey.cogroupSorted to fix surprising error messages and to make Dataset.groupByKey.mapValues.agg work at all.

Why are the changes needed?

Calling ds.groupByKey(func: V => K) creates columns to store the key value. These columns may conflict with columns that already exist in ds. Function Dataset.groupByKey.agg accounts for this with a very specific rule, which has some surprising weaknesses:

spark.range(1)
  // groupByKey adds column 'value'
  .groupByKey(id => id)
  // which cannot be referenced, though it is suggested
  .agg(count("value"))
org.apache.spark.sql.AnalysisException: Column 'value' does not exist.
Did you mean one of the following? [value, id];

An existing 'value' column can be referenced:

// dataset with column 'value'
spark.range(1).select($"id".as("value")).as[Long]
  // groupByKey adds another column 'value'
  .groupByKey(id => id)
  // agg accounts for the extra column and excludes it when resolving 'value'
  .agg(count("value"))
  .show()
+---+------------+
|key|count(value)|
+---+------------+
|  0|           1|
+---+------------+

While column suggestion shows both 'value' columns:

spark.range(1).select($"id".as("value")).as[Long]
  .groupByKey(id => id)
  .agg(count("unknown"))
org.apache.spark.sql.AnalysisException: Column 'unknown' does not exist.
Did you mean one of the following? [value, value]

However, mapValues introduces another 'value' column, which should be referencable, but it breaks the exclusion introduced by agg:

spark.range(1)
  // groupByKey adds column 'value'
  .groupByKey(id => id)
  // adds another 'value' column
  .mapValues(value => value)
  // which cannot be referenced in agg
  .agg(count("value"))
org.apache.spark.sql.AnalysisException: Reference 'value' is ambiguous, could be: value, value.

Does this PR introduce any user-facing change?

Fixes suggestions in above error messages from

... Did you mean one of the following? [value, value]
... Did you mean one of the following? [value, id];

to

... Did you mean one of the following? [value]
... Did you mean one of the following? [id];

Allows for Dataset.groupByKey.mapValues.agg.

How was this patch tested?

Tests in DatasetSuite.

EnricoMi avatar Jan 26 '23 15:01 EnricoMi

@cloud-fan what do you think?

EnricoMi avatar Feb 13 '23 12:02 EnricoMi

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

github-actions[bot] avatar Jun 30 '23 00:06 github-actions[bot]

@cloud-fan @dongjoon-hyun are you interested in fixing these inconsistencies?

EnricoMi avatar Jan 10 '24 19:01 EnricoMi

I have moved the logic that adds the ScopedExpression into a rule because some code paths (especially connect) call into the case class, which cannot rewrite the given sort argument(s). Alternative would be to duplicate that logic.

All tests green.

EnricoMi avatar Jan 19 '24 06:01 EnricoMi

@cloud-fan all tests green and I think all comments are addressed

EnricoMi avatar Jan 29 '24 13:01 EnricoMi

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

github-actions[bot] avatar Jul 30 '24 00:07 github-actions[bot]

@EnricoMi sorry this PR is lost track. Have you addressed all the review comments?

cloud-fan avatar Jul 31 '24 00:07 cloud-fan

Have you addressed all the review comments?

@cloud-fan all comments addressed, see above.

EnricoMi avatar Aug 01 '24 18:08 EnricoMi