spark icon indicating copy to clipboard operation
spark copied to clipboard

[SPARK-52104][CONNECT][SCALA] Validate column name eagerly in Spark Connect Scala Client

Open xi-db opened this issue 7 months ago • 4 comments

What changes were proposed in this pull request?

Currently, calling DataFrame.col(colName) with a non-existent column in the Scala Spark Connect client does not raise an error. In contrast, both PySpark (in either Spark Connect or Spark Classic) and Scala in Spark Classic do raise an exception in such cases. This leads to inconsistent behavior between Spark Connect and Spark Classic in Scala.

PySpark on Spark Classic:

> spark.range(10)['nonexistent']
[UNRESOLVED_COLUMN.WITH_SUGGESTION] A column, variable, or function parameter with name `nonexistent` cannot be resolved. Did you mean one of the following? [`id`]. SQLSTATE: 42703

PySpark on Spark Connect:

> spark.range(10)['nonexistent']
[UNRESOLVED_COLUMN.WITH_SUGGESTION] A column, variable, or function parameter with name `nonexistent` cannot be resolved. Did you mean one of the following? [`id`]. SQLSTATE: 42703

Scala on Spark Classic:

> spark.range(10).col("nonexistent")
AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column, variable, or function parameter with name `nonexistent` cannot be resolved. Did you mean one of the following? [`id`]. SQLSTATE: 42703

Scala on Spark Connect:

> spark.range(10).col("nonexistent")
res5: org.apache.spark.sql.Column = unresolved_attribute {
  unparsed_identifier: "nonexistent"
  plan_id: 2
}

it doesn't throw any exceptions.

In this PR, eager validation of column names has been implemented in the DataFrame.col(colName) method of the Scala client to ensure consistent behavior with both Spark Classic and PySpark. The implementation here is based on the __getitem__ and verify_col_name methods in PySpark.

Now, it will throw an error in Scala client on Spark Connect:

> spark.range(10).col("nonexistent")
AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column, variable, or function parameter with name `nonexistent` cannot be resolved. Did you mean one of the following? [`id`]. SQLSTATE: 42703

Why are the changes needed?

This PR ensures consistent behavior between Spark Connect and Spark Classic in the scenario described above.

Does this PR introduce any user-facing change?

Yes, referencing non-existent column in Scala client will now throw an error.

How was this patch tested?

New test case.

Was this patch authored or co-authored using generative AI tooling?

No.

xi-db avatar May 13 '25 15:05 xi-db

Hi @zhengruifeng , I'm fixing the behaviour difference of referencing a non-existent column in Spark Connect Scala, based on the PySpark's __getitem__ and verify_col_name methods, could you please review this PR?

xi-db avatar May 13 '25 15:05 xi-db

you may need to resolve the test failures, otherwise LGTM

zhengruifeng avatar May 14 '25 00:05 zhengruifeng

@xi-db the Connect API is supposed to be lazy. That we did this in Python is a mistake. Concretely, I can see two problems with this:

  • It can create quite a few more extra RPCs.
  • It is misleading. By the time you submit something for execution, your underlying data might have changed. You will see a failure anyway. This works for classic because we have eager analysis, and the Dataset is bound at definition time instead of execution time.

hvanhovell avatar Jun 10 '25 13:06 hvanhovell

@hvanhovell Thank you, makes sense. In this case, we can close this PR. Do we want to remove the column name validation from __getitem__ on PySpark, so df['non_existing_col'] won't trigger an analysis call? cc @zhengruifeng

xi-db avatar Jun 10 '25 14:06 xi-db

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

github-actions[bot] avatar Sep 19 '25 00:09 github-actions[bot]