spark icon indicating copy to clipboard operation
spark copied to clipboard

[SPARK-51919][PYTHON] Allow overwriting statically registered Python Data Source

Open wengh opened this issue 7 months ago • 3 comments

What changes were proposed in this pull request?

  • Allow overwriting static Python Data Sources during registration
  • Update documentation to clarify Python Data Source behavior and registration options

Why are the changes needed?

Static registration is a bit obscure and doesn't always work as expected (e.g. when the module providing DefaultSource is installed after lookup_data_sources already ran). So in practice users (or LLM agents) often want to explicitly register the data source even if it is provided as a DefaultSource. Raising an error in this case interrupts the workflow, making LLM agents spend extra tokens regenerating the same code but without registration.

This change also makes the behavior consistent with user data source registration which are already allowed to overwrite previous user registrations.

Does this PR introduce any user-facing change?

Yes. Previously, registering a Python Data Source with the same name as a statically registered one would throw an error. With this change, it will overwrite the static registration.

How was this patch tested?

Added a test in PythonDataSourceSuite.scala to verify that static sources can be overwritten correctly.

Was this patch authored or co-authored using generative AI tooling?

No

wengh avatar Apr 25 '25 15:04 wengh

@allisonwang-db @HyukjinKwon please take a look

wengh avatar Apr 25 '25 16:04 wengh

cc @allisonwang-db

HyukjinKwon avatar Apr 28 '25 08:04 HyukjinKwon

LGTM

the-sakthi avatar May 28 '25 18:05 the-sakthi

LGTM!

the-sakthi avatar Jul 08 '25 17:07 the-sakthi

thanks, merging to master

allisonwang-db avatar Jul 08 '25 18:07 allisonwang-db