spark [SPARK-46894][PYTHON] Move PySpark error conditions into standalone JSON file

What changes were proposed in this pull request?

Move PySpark error conditions into a standalone JSON file.

Adjust the instructions in MANIFEST.in so they package the new JSON file correctly.

Why are the changes needed?

Having the JSON in its own file enables better IDE support for editing and managing the JSON. For example, VS Code will detect duplicate keys automatically and underline them.

This change also simplifies the logic to regenerate the JSON.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing unit tests.

I confirmed that rewriting the JSON file works as intended:

from pyspark.errors.exceptions import _write_self
_write_self()

I built a PySpark distribution, installed it, and confirmed the error class test still works.

trash dist/
./dev/make-distribution.sh --pip -DskipTests -Phive

# The JSON file is put in the dist/ directory.
find dist -name 'error-conditions.json'
# dist/python/pyspark/errors/error-conditions.json

# It's also listed in SOURCES.txt.
grep error-conditions.json dist/python/pyspark.egg-info/SOURCES.txt 
# pyspark/errors/error-conditions.json

cd dist/python
python -m venv venv
source venv/bin/activate
pip install .

# It also gets installed into the newly created venv.
find venv -name error-conditions.json
# venv/lib/python3.11/site-packages/pyspark/errors/error-conditions.json

# Running the test from inside the venv also succeeds.
python pyspark/errors/tests/test_errors.py
# Ran 3 tests in 0.000s
# 
# OK

Was this patch authored or co-authored using generative AI tooling?

No.

Jan 28 '24 21:01 nchammas

@itholic can you take a look please? I remember you took a look and failed to find a good way to upload JSON.

Jan 29 '24 02:01 HyukjinKwon

IIRC there was no major issue with managing the json itself. However, since we cannot integrate with the error-classes.json file on the JVM side - because we didn't want to have a JVM dependency -, we simply adopted a py file that is more convenient way to manage in Python (That said, we couldn't see any benefit to using json instead of py at that time).

So I agree to change to a json file if the advantage of using a json file over using a py file is clear, and if there are no issues with packaging. Also you might need to take a deeper look at the documentation. For example we're pointing the py file path from Error classes in PySpark.

Jan 29 '24 03:01 itholic

since we cannot integrate with the error-classes.json file on the JVM side

This is for a separate potential PR, but if it were possible to use the "main" error JSON files from Scala in PySpark automatically, would we want to do that? I see that PySpark's errors don't define a SQLSTATE, so I assumed they were a separate thing and we didn't want to reuse the main error definitions.

So I agree to change to a json file if the advantage of using a json file over using a py file is clear, and if there are no issues with packaging. Also you might need to take a deeper look at the documentation. For example we're pointing the py file path from Error classes in PySpark.

Is this a reference to this command? https://github.com/apache/spark/blob/8060e7e73170c0122acb2a005f3c54487e226208/python/docs/source/conf.py#L42

Anyway, just FYI I am in the middle of revamping the main error documentation to make it easier to maintain. First I am cleaning up the terminology in #44902 but then I will be consolidating the various sql-error-* pages. I will tag you in those PRs when I open them.

Jan 29 '24 03:01 nchammas

The build is still running, but the pyspark-core tests are passing. I believe importlib.resources is what we need to load data files packaged in the distribution we upload to PyPI.

Jan 29 '24 05:01 nchammas

This is for a separate potential PR, but if it were possible to use the "main" error JSON files from Scala in PySpark automatically, would we want to do that?

I don't think so. As I recall, the main reason for not doing it was because, as you said, the error structure on the PySpark side is different from the error structure on the JVM side.

I will be consolidating the various sql-error-* pages. I will tag you in those PRs when I open them.

+1

Jan 29 '24 07:01 itholic

Is this a reference to this command?

Yes, so you might need to fix the description from https://github.com/apache/spark/blob/master/python/pyspark/errors_doc_gen.py#L44.

https://github.com/apache/spark/blob/90e6c0cf2ca186d1a492af4dc995b8254aa77aae/python/pyspark/errors_doc_gen.py#L44

Jan 29 '24 07:01 itholic

I think we should wait for the conversation in SPARK-46810 to resolve before merging this in.

But apart from that, is there anything more you'd like me to check here? Do you approve of the use of importlib.resources (which I think is the "correct" solution in our case)?

Jan 29 '24 15:01 nchammas

Converting to draft until SPARK-46810 is resolved.

Jan 30 '24 05:01 nchammas

We have agreed in SPARK-46810 to rename "error class" to "error condition", so this PR is unblocked since we know we won't need to rename the new error-conditions.json file.

The work to rename all instances of "error class" to "error condition" across the board will happen in SPARK-46810 and SPARK-47429. I would like to keep this PR focused on simply moving the Python error conditions into a JSON file.

@HyukjinKwon - I believe this PR is ready to go. Do you have any oustanding concerns?

Mar 24 '24 16:03 nchammas

Just to make sure, does it work when you install PySpark as a ZIP file? e.g., downloading it from https://spark.apache.org/downloads.html would install PySpark as a ZIP file.

Mar 28 '24 02:03 HyukjinKwon

I've updated the PR description with this additional ZIP test (test 4).

Just to confirm, the ZIP that gets uploaded to the site is the one under python/dist/. Is that correct?

Mar 28 '24 15:03 nchammas

It's the one python/lib/pyspark.zip when you finish building.

Mar 29 '24 02:03 HyukjinKwon

Hmm, so when people install this ZIP how exactly do they do it? Because it does not install cleanly like the ZIP under python/dist/.

$ pip install .../spark/python/lib/pyspark.zip
Processing .../spark/python/lib/pyspark.zip
ERROR: file:///.../spark/python/lib/pyspark.zip does not appear to be a Python project:
  neither 'setup.py' nor 'pyproject.toml' found.

Mar 29 '24 04:03 nchammas

People can actually directly use PySpark via importing pyspark.zip, see https://spark.apache.org/docs/latest/api/python/getting_started/install.html?highlight=pythonpath#manually-downloading

Mar 29 '24 04:03 HyukjinKwon

OK, I tested that as well and updated the PR description accordingly.

I also tweaked the syntax highlighting for that bit of documentation you linked to, because it was off. This is how it currently looks:

Screenshot 2024-03-29 at 11 30 26 AM

Note the weird italicization and missing *.

Mar 29 '24 15:03 nchammas

@HyukjinKwon - Anything else you'd like to see done here?

Apr 08 '24 17:04 nchammas

Friendly ping @HyukjinKwon.

Apr 30 '24 01:04 nchammas

Merged to master.

May 01 '24 00:05 HyukjinKwon

spark spark copied to clipboard

[SPARK-46894][PYTHON] Move PySpark error conditions into standalone JSON file

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

spark
spark copied to clipboard