spark
spark copied to clipboard
[SPARK-46894][PYTHON] Move PySpark error conditions into standalone JSON file
What changes were proposed in this pull request?
Move PySpark error conditions into a standalone JSON file.
Adjust the instructions in MANIFEST.in so they package the new JSON file correctly.
Why are the changes needed?
Having the JSON in its own file enables better IDE support for editing and managing the JSON. For example, VS Code will detect duplicate keys automatically and underline them.
This change also simplifies the logic to regenerate the JSON.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
-
Existing unit tests.
-
I confirmed that rewriting the JSON file works as intended:
from pyspark.errors.exceptions import _write_self _write_self() -
I built a PySpark distribution, installed it, and confirmed the error class test still works.
trash dist/ ./dev/make-distribution.sh --pip -DskipTests -Phive # The JSON file is put in the dist/ directory. find dist -name 'error-conditions.json' # dist/python/pyspark/errors/error-conditions.json # It's also listed in SOURCES.txt. grep error-conditions.json dist/python/pyspark.egg-info/SOURCES.txt # pyspark/errors/error-conditions.json cd dist/python python -m venv venv source venv/bin/activate pip install . # It also gets installed into the newly created venv. find venv -name error-conditions.json # venv/lib/python3.11/site-packages/pyspark/errors/error-conditions.json # Running the test from inside the venv also succeeds. python pyspark/errors/tests/test_errors.py # Ran 3 tests in 0.000s # # OK
Was this patch authored or co-authored using generative AI tooling?
No.
@itholic can you take a look please? I remember you took a look and failed to find a good way to upload JSON.
IIRC there was no major issue with managing the json itself. However, since we cannot integrate with the error-classes.json file on the JVM side - because we didn't want to have a JVM dependency -, we simply adopted a py file that is more convenient way to manage in Python (That said, we couldn't see any benefit to using json instead of py at that time).
So I agree to change to a json file if the advantage of using a json file over using a py file is clear, and if there are no issues with packaging. Also you might need to take a deeper look at the documentation. For example we're pointing the py file path from Error classes in PySpark.
since we cannot integrate with the error-classes.json file on the JVM side
This is for a separate potential PR, but if it were possible to use the "main" error JSON files from Scala in PySpark automatically, would we want to do that? I see that PySpark's errors don't define a SQLSTATE, so I assumed they were a separate thing and we didn't want to reuse the main error definitions.
So I agree to change to a
jsonfile if the advantage of using ajsonfile over using apyfile is clear, and if there are no issues with packaging. Also you might need to take a deeper look at the documentation. For example we're pointing thepyfile path from Error classes in PySpark.
Is this a reference to this command? https://github.com/apache/spark/blob/8060e7e73170c0122acb2a005f3c54487e226208/python/docs/source/conf.py#L42
Anyway, just FYI I am in the middle of revamping the main error documentation to make it easier to maintain. First I am cleaning up the terminology in #44902 but then I will be consolidating the various sql-error-* pages. I will tag you in those PRs when I open them.
The build is still running, but the pyspark-core tests are passing. I believe importlib.resources is what we need to load data files packaged in the distribution we upload to PyPI.
This is for a separate potential PR, but if it were possible to use the "main" error JSON files from Scala in PySpark automatically, would we want to do that?
I don't think so. As I recall, the main reason for not doing it was because, as you said, the error structure on the PySpark side is different from the error structure on the JVM side.
I will be consolidating the various sql-error-* pages. I will tag you in those PRs when I open them.
+1
Is this a reference to this command?
Yes, so you might need to fix the description from https://github.com/apache/spark/blob/master/python/pyspark/errors_doc_gen.py#L44.
https://github.com/apache/spark/blob/90e6c0cf2ca186d1a492af4dc995b8254aa77aae/python/pyspark/errors_doc_gen.py#L44
I think we should wait for the conversation in SPARK-46810 to resolve before merging this in.
But apart from that, is there anything more you'd like me to check here? Do you approve of the use of importlib.resources (which I think is the "correct" solution in our case)?
Converting to draft until SPARK-46810 is resolved.
We have agreed in SPARK-46810 to rename "error class" to "error condition", so this PR is unblocked since we know we won't need to rename the new error-conditions.json file.
The work to rename all instances of "error class" to "error condition" across the board will happen in SPARK-46810 and SPARK-47429. I would like to keep this PR focused on simply moving the Python error conditions into a JSON file.
@HyukjinKwon - I believe this PR is ready to go. Do you have any oustanding concerns?
Just to make sure, does it work when you install PySpark as a ZIP file? e.g., downloading it from https://spark.apache.org/downloads.html would install PySpark as a ZIP file.
I've updated the PR description with this additional ZIP test (test 4).
Just to confirm, the ZIP that gets uploaded to the site is the one under python/dist/. Is that correct?
It's the one python/lib/pyspark.zip when you finish building.
Hmm, so when people install this ZIP how exactly do they do it? Because it does not install cleanly like the ZIP under python/dist/.
$ pip install .../spark/python/lib/pyspark.zip
Processing .../spark/python/lib/pyspark.zip
ERROR: file:///.../spark/python/lib/pyspark.zip does not appear to be a Python project:
neither 'setup.py' nor 'pyproject.toml' found.
People can actually directly use PySpark via importing pyspark.zip, see https://spark.apache.org/docs/latest/api/python/getting_started/install.html?highlight=pythonpath#manually-downloading
OK, I tested that as well and updated the PR description accordingly.
I also tweaked the syntax highlighting for that bit of documentation you linked to, because it was off. This is how it currently looks:
Note the weird italicization and missing *.
@HyukjinKwon - Anything else you'd like to see done here?
Friendly ping @HyukjinKwon.
Merged to master.