quinn icon indicating copy to clipboard operation
quinn copied to clipboard

Refactor the DataFrame#transform method to be more elegant

Open MrPowers opened this issue 8 years ago • 8 comments

This library defines a DataFrame.transform method to chain DataFrame transformations as follows:

from pyspark.sql.functions import lit

def with_greeting(df):
    return df.withColumn("greeting", lit("hi"))

def with_something(df, something):
    return df.withColumn("something", lit(something))

data = [("jose", 1), ("li", 2), ("luisa", 3)]
source_df = spark.createDataFrame(data, ["name", "age"])

actual_df = source_df\
    .transform(lambda df: with_greeting(df))\
    .transform(lambda df: with_something(df, "crazy"))

The Spark Scala API has a built-in transform method that lets users chain DataFrame transformations more elegantly, as described in this blog post.

Here's an interface I'd prefer (this is what we do in Scala and I know this will need to be changed around for Python, but I'd like something like this):

def with_greeting()(df):
    return df.withColumn("greeting", lit("hi"))

def with_something(something)(df):
    return df.withColumn("something", lit(something))

data = [("jose", 1), ("li", 2), ("luisa", 3)]
source_df = spark.createDataFrame(data, ["name", "age"])

actual_df = source_df\
    .transform(with_greeting())\ # the transform method magically knows that self should be passed into the second parameter list
    .transform(with_something("crazy"))

Here is the code that needs to be changed.

If we can figure out a better interface, we should consider making a pull request to the Spark source code. I use the transform method every day when writing Spark/Scala code and think this is a major omission in the PySpark API.

If my ideal interface isn't possible is there anything that's better?! I really don't like my current solution that requires lambda.

@pirate - help!

MrPowers avatar Oct 31 '17 17:10 MrPowers

Try a closure:

def with_greeting(df):
    return df.withColumn("greeting", lit("hi"))

def with_something(something):
    def partial(df):
        return df.withColumn("something", lit(something))
    return partial

data = [("jose", 1), ("li", 2), ("luisa", 3)]
source_df = spark.createDataFrame(data, ["name", "age"])

actual_df = (source_df
    .transform(with_greeting)   # no lambda required
    .transform(with_something("crazy"))

In JS this looks like:

const myFunc = (first_set_of_args) => (second_set_of_args) => {
    ...function body
}

pirate avatar Oct 31 '17 18:10 pirate

functools.partial actually works for this too, although I think the closure method is cleaner/easier to understand:

from functools import partial


def with_greeting(df):
    return df.withColumn("greeting", lit("hi"))

def with_something(something, df):
    return df.withColumn("something", lit(something))

data = [("jose", 1), ("li", 2), ("luisa", 3)]
source_df = spark.createDataFrame(data, ["name", "age"])

actual_df = (source_df
    .transform(with_greeting)
    .transform(partial(with_something, "crazy"))

pirate avatar Nov 01 '17 16:11 pirate

Thanks @pirate.

I updated the test suite to demonstrate how functools.partial can be used. I also changed the string "luisa" to "liz" based on a code review from @lizparody 😉

I also updated the blog post to include a functools.partial example.

Thanks!

MrPowers avatar Nov 01 '17 17:11 MrPowers

AAHAHHAHAHAHAHAHAHAHAHAHAHAHAH 😂 it was a joke!!!

On Wed, Nov 1, 2017 at 12:32 PM, Matthew Powers [email protected] wrote:

Thanks @pirate https://github.com/pirate.

I updated the test suite https://github.com/MrPowers/quinn/commit/283629acd2a5544e53ad55ebbf92c77174bef799 to demonstrate how functools.partial can be used. I also changed the string "luisa" to "liz" based on a code review from @lizparody https://github.com/lizparody 😉

I also updated the blog post https://medium.com/@mrpowers/chaining-custom-pyspark-transformations-4f38a8c7ae55 to include a functools.partial example.

Thanks!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/MrPowers/quinn/issues/6#issuecomment-341179434, or mute the thread https://github.com/notifications/unsubscribe-auth/AMf4JP1hDc8_HUrcEuRF61QwKuIdi1bpks5syKs6gaJpZM4QNKmu .

LizzParody avatar Nov 01 '17 17:11 LizzParody

Fyi @MrPowers, you don't need partial on the first transform func for the same reason that you don't need a lambda there:

lambda x: func(x) == partial(func) == func

actual_df = (source_df
    .transform(with_greeting)
    .transform(partial(with_jacket, "warm")))

I also recommend using the word "closure" or "higher order function" somewhere in your blog post, as those are the "standard" names instead of "nested function".

Great blog post though, nice work!

pirate avatar Nov 01 '17 17:11 pirate

Thanks @pirate - I updated the code and blog post accordingly.

Thanks for all the help here - I really appreciate the feedback. Feel free to rip up my code or blog posts anytime!!!

MrPowers avatar Nov 01 '17 18:11 MrPowers

@pirate - @capdevc showed me how to use cytoolz to run multiple custom DataFrame transformations with function composition. Take a look at this commit.

Thanks @capdevc!!!

MrPowers avatar Nov 06 '17 04:11 MrPowers

@MrPowers I really like cytoolz and use it alot, but it's a pretty heavy dependency to pull in for just the curry decorator. Curry and compose are also available in toolz, which is the same as cytoolz minus the cython bits, which shouldn't matter in this application. You could also just add your own curry decorator since it's just a wrapper around functools.partial.

capdevc avatar Nov 06 '17 13:11 capdevc

Closing this now that DataFrame#transform has been included in PySpark. Really appreciate everyone's help.

MrPowers avatar Sep 27 '23 00:09 MrPowers