spark
spark copied to clipboard
AWS Glue Compatibility
Hello everyone, I know that AWS Glue is not in the supported platforms list, but I decided to give it a try and see if it would work. This attempt failed, resulting in an error when initializing the Spark Context. I was wondering if this is a known issue, or if anyone managed to get this working.
Environment spark version: 3.3 platform: Glue 4.0
To Reproduce Steps to reproduce the behavior:
- Download jar from maven repo
- Upload to S3
- Add to job's dependent jars
- Set plugin config in SparkSession builder (or set as --conf property)
- Run the script
- See the error
Expected behavior Session and context initialized and job running successfully.
Additional context
Returned error:
File "/tmp/job.py", line 78, in
Hi @VitorNoro! sorry for the long response
The issue with supporting AWS Glue with DataFlint OSS is that the Spark UI is not enabled on the cluster.
When you use the "Spark UI" it's actually a managed history server (which there is no way to run custom code on, such as DataFlint), that reads from a S3 Bucket events that the Spark Driver write to every 30 seconds.
See more at https://docs.aws.amazon.com/glue/latest/dg/monitor-spark-ui-jobs.html
You could host a history server yourself with DataFlint plugin installed (see instructions here https://dataflint.gitbook.io/dataflint-for-spark/getting-started/install-on-spark-history-server) and point it to the S3 bucket with the events. See instructions here: https://docs.aws.amazon.com/glue/latest/dg/monitor-spark-ui-history.html. You can also initially host this history server locally from your laptop to test DataFlint our.
Another options that I'm currently working on a SaaS offering for DataFlint, that will send the summary of your spark job to a SaaS solution with additional features (graph of job duration/resource usage/input size over time, recommendations, alerts etc...). In this SaaS portal when you select a job run you could also see it's Spark UI & DataFlint UI. This offering will also support AWS Glue.
If this is something that interests you please let me know.
I'm keeping this issue open until I will add a better error message when trying to run DataFlint on AWS glue
Thank you for the response! We'll consider our options, though it's likelier we move away from Glue in time.
Cool! if there anything else I can do to help you, you can contact me via the DataFlint slack community (join link in the README) or via linkedin (https://www.linkedin.com/in/meni-shmueli-developer/)
Added an alert "No UI detected, skipping installation" if UI is turned off