pyspark-ai
pyspark-ai copied to clipboard
Using PySpark code instead of SQL
Right now, every DF transformation creates a new temp view and the transformation is applied as a SQL query on top of the temp view. Unfortunately, this creates a lot of state in the Spark session and makes it harder to trace the actual source of the request.
It would be awesome if one could chose between PySpark and SQL transformations. I've had some good success with the following prompt template.
Given a PySpark dataframe with the name `df` and with the following schema:
id: bigint
dropoff_zip: string
pickup_zip: string
fare_amount: double
toll_amount: double
tip: double
passenger_count: int
Write a Python function called `transform_df` that performs the following transformation and returns a new dataframe: Show only rows with more than 3 passengers.
The answer MUST contain one function only. Ensure your answer is correct.
Ideally, this could be embedded in PySpark AI like this:
transformed = df.ai.transform("Show only rows with more than 3 passengers")
transformed.explain()
shows the full trace of the operation instead of just the read from the temp view.
@grundprinzip We have thought about this.
Using SQL is safer. With proper permission settings, the SDK can only perform SELECT
for the transform. Running arbitrary python code from LLM may be risky.
this creates a lot of state in the Spark session
There is not a lot of state. All the created temp view are of the same name.
makes it harder to trace the actual source of the request
SQL is more readable than the PySpark DateFrame APIs
I will evaluate and try supporting Python code generation.
I don't think it's necessarily a burning issue. I have some ideas how one could leverage the response and the generated code in a way to generate code that then allows it to be checked in in a reproducible way for example.
For example imagine that when you call df.transform(prompt, cache=True)
it will actually generate the following python code:
main.py
# Generates the hash of the prompt and then behind the scenes generates
df.transform("my prompt")
generated/
def handler(df, *args, **kwargs):
"""Auto generated code"""
return df.filter(df.col < 10)
Now, when you start the application next time and run it, it will actually try to import the generated module first before going to OpenAI (Similar to the cache today). The good thing is that you can check-in the generated module and reason about it independently of other items in the cache. You can even use it without the whole module.
I'll think a bit about this.