ruff icon indicating copy to clipboard operation
ruff copied to clipboard

Pyspark Linting Rules

Open sbrugman opened this issue 2 years ago • 5 comments

Apache Spark is widely used in the python ecosystem for distributed computing. As user of spark I would like for ruff to lint problematic behaviours. The automation that ruff offers is especially useful in projects with various levels of software engineering skills, e.g. where people has more of a statistics background.

There exists a pyspark style guide and pylint extension.

I would like to start contributing a rule that checks for repeated use of withColumn:

This method introduces a projection internally. Therefore, calling it multiple times, for instance, via loops in order to add multiple columns can generate big plans which can cause performance issues and even StackOverflowException. To avoid this, use select() with multiple columns at once.

This violation seems common in existing code bases. Are you ok with a PR introducing "Spark-specific rules" (e.g. SPK)?

  • [ ] SPK001: repeated withColumn usage, use withColumns or select
  • [ ] SPK002: repeated withColumnRenamed usage, use withColumnsRenamed
  • [ ] SPK003: repeated drop usage, consolidate in single call
  • [ ] SPK004: F.date_format with simple argument, replace with specialised function (e.g. F.hour)
  • [ ] SPK005: direct access column selection (e.g. F.lower(df.col)), use implicit column selection (e.g. F.lower(F.col("col")))
  • [ ] SPK006: unnecessary F.col (in F.lower(F.col('my_column'))), use F.lower('my_column').

ruff includes rules that are specific to third party libraries: numpy, pandas and airflow. Spark support would be a nice addition.

I would like to close with the following thought: supporting third-party packages may at first seem to be effort in the long tail of possible rules to add to ruff. Why not focus only on rules that affect all Python users? I hope that adding these will lead to creating helper functions that make adding new rules easier. I also think that these libraries will end up with similar API design patterns, that can be linted across the ecosystem. As an example, call chaining is common for many packages that perform transformations.

sbrugman avatar Sep 11 '23 16:09 sbrugman

I'm generally open to adding package-specific rule sets for extremely popular packages (as with Pandas, NumPy, etc.), and Spark would fit that description. However, it'd be nice to have a few rules lined up before we move forward and add any one of them. Otherwise, we run the risk that we end up with really sparse categories that only contain a rule or two.

charliermarsh avatar Sep 12 '23 01:09 charliermarsh

Super. I've updated the issue with a couple of rules that we can track.

sbrugman avatar Sep 12 '23 06:09 sbrugman

Hi, I was looking for such a thread.

To add to the proposed list, here are some rules we wish we had at my company:

  • unnecessary drop followed by a select
  • use unionByName instead of union / unionAll
  • use df.writeTo(...).append() instead of df.write.insertInto(...)
  • use df.writeTo(...).overwritePartitions() instead of df.write.insertInto(..., overwrite=True)
  • replace udf with native spark functions
  • alias pyspark.sql.functions to F -> from pyspark.sql import ..., functions as F, ...

guilhem-dvr avatar Feb 09 '24 09:02 guilhem-dvr

Just to add that I would be interested in this functionality.

Also, the first link in the original post is broken, and the pylint extension looks unmaintained?

  • https://github.com/palantir/pyspark-style-guide is a written PySpark guide, and contains some pylint implementations under https://github.com/palantir/pyspark-style-guide/tree/develop/src/checkers
  • https://nhsdigital.github.io/rap-community-of-practice/training_resources/pyspark/pyspark-style-guide/ is based off the guide above

amadeuspzs avatar Feb 28 '24 15:02 amadeuspzs

Super. I've updated the issue with a couple of rules that we can track. I'll kick off with SPK001-3

Did you get going with this? Thinking about jumping on it

stkrzysiak avatar Jun 25 '24 16:06 stkrzysiak

Currently looking at recommending Ruff for data team and this feature would be great.

Rexeh avatar Aug 21 '24 13:08 Rexeh

Super. I've updated the issue with a couple of rules that we can track. I'll kick off with SPK001-3

Did you get going with this? Thinking about jumping on it

Hey! Did you end up jumping on it? We're also considering starting it, so I'd love to hear how it went for you!

aran159 avatar Sep 06 '24 05:09 aran159

I'm working on getting a first set of rules out there :)

sbrugman avatar Sep 06 '24 09:09 sbrugman

@guilhem-dvr Thanks for your suggestions! Would you have an example for the following?

unnecessary drop followed by a select

Also note that the import convention is already possible via: https://docs.astral.sh/ruff/settings/#lint_flake8-import-conventions_aliases

sbrugman avatar Sep 06 '24 12:09 sbrugman

Sure, here's what I had in mind:

df = spark.createDataFrame(
    [("John", 25, "Engineer"), ("Jane", 30, "Doctor"), ("Jim", 35, "Teacher")],
    ["Name", "Age", "Profession"],
)

# Uncessary drop
df.drop("Age").select("Name", "Profession")

# Same statement without the drop
df.select("Name", "Profession")

But now I see that there's a pattern that shouldn't be flagged: where an 'anti' select is performed, i.e. drop some columns then select all the remaining ones with select("*"):

# Reusing the previous df schema
df.drop("Age").select("*")

Edit: this is still a bad pattern because drop already returns the whole dataframe - minus the dropped column - so you should never be chaining drop and select anyway.

Also note that the import convention is already possible via: https://docs.astral.sh/ruff/settings/#lint_flake8-import-conventions_aliases

Lol, I had completely skimmed over the settings, thank you!

guilhem-dvr avatar Sep 09 '24 08:09 guilhem-dvr

Thanks for clarifying! There is not as much Spark open-source code available as there is for other libraries, so it's hard to tell how frequent this error is - but a good addition nonetheless. Funny enough, Polars uses this in their tests: https://github.com/pola-rs/polars/blob/main/py-polars/tests/unit/lazyframe/test_lazyframe.py#L1090

sbrugman avatar Sep 09 '24 08:09 sbrugman

Super. I've updated the issue with a couple of rules that we can track. I'll kick off with SPK001-3

Did you get going with this? Thinking about jumping on it

Hey! Did you end up jumping on it? We're also considering starting it, so I'd love to hear how it went for you!

No, but looks like @sbrugman has, which is exciting!

stkrzysiak avatar Sep 09 '24 10:09 stkrzysiak

I'd love to contribute to this one! For Pyspark style guide references, I'll suggest you considering also this one from @mrpowers

montanarograziano avatar Nov 26 '24 21:11 montanarograziano

@montanarograziano do you have specific rules in mind from that style guide?

sbrugman avatar Nov 27 '24 18:11 sbrugman

@montanarograziano do you have specific rules in mind from that style guide?

Apart from those already mentioned, I was thinking on something related to naming conventions. Here's what the guide suggests:

  • Variables pointing to DataFrames should be suffixed with df
  • Variables pointing to RDDs should be suffixed with rdd
  • with precedes transformations that add columns:
  • filter precedes transformations that remove rows
  • explode precedes transformations that add rows to a DataFrame by "exploding" a row into multiple rows.

montanarograziano avatar Nov 28 '24 10:11 montanarograziano

Thanks. These are good candidates to become linting rules, however would need type information to determine if a variable is a DataFrame, RDD or transformation reliably. The Astral team is working on that, but it's not available yet.

sbrugman avatar Nov 28 '24 11:11 sbrugman

I am also used to glueContext = GlueContext(SparkContext.getOrCreate()) and from pyspark.sql import functions as F but I had to ignore N816 and N812 from pep8-naming because they gave errors in every pull request.

lauragalera avatar Dec 10 '24 15:12 lauragalera

pyspark.sql.functions is also aliased as sf in Apache Spark docs, so enforcing aliasing it as uppercase F doesn't seem justified.

See here: https://spark.apache.org/docs/latest/api/python/_modules/pyspark/sql/functions.html

magdanowak avatar Mar 21 '25 07:03 magdanowak

Hi there, I'd be super interested in using the linting set here and was wondering what the status? Are there already linting pyspark rules on a release version? :) Thanks a ton!

I'd love to be able to have a warning/info rule for spark actions just to easily spot them in a codebase. Perhaps even a specific rule for when they're used only for a log statement.

MarthinusBosman avatar Jul 15 '25 07:07 MarthinusBosman