spark AWS Glue Compatibility

Hello everyone, I know that AWS Glue is not in the supported platforms list, but I decided to give it a try and see if it would work. This attempt failed, resulting in an error when initializing the Spark Context. I was wondering if this is a known issue, or if anyone managed to get this working.

Environment spark version: 3.3 platform: Glue 4.0

To Reproduce Steps to reproduce the behavior:

Download jar from maven repo
Upload to S3
Add to job's dependent jars
Set plugin config in SparkSession builder (or set as --conf property)
Run the script
See the error

Expected behavior Session and context initialized and job running successfully.

Additional context Returned error: File "/tmp/job.py", line 78, in .getOrCreate() File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 269, in getOrCreate sc = SparkContext.getOrCreate(sparkConf) File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/context.py", line 491, in getOrCreate SparkContext(conf=conf or SparkConf()) File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/context.py", line 197, in init self._do_init( File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/context.py", line 282, in _do_init self._jsc = jsc or self._initialize_context(self._conf._jconf) File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/context.py", line 410, in _initialize_context return self._jvm.JavaSparkContext(jconf) File "/opt/amazon/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1585, in call return_value = get_return_value( File "/opt/amazon/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py", line 326, in get_return_value raise Py4JJavaError( py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext. : java.util.NoSuchElementException: None.get at scala.None$.get(Option.scala:529) at scala.None$.get(Option.scala:527) at org.apache.spark.dataflint.DataflintSparkUILoader$.install(DataflintSparkUILoader.scala:17) at io.dataflint.spark.SparkDataflintDriverPlugin.registerMetrics(SparkDataflintPlugin.scala:26) at org.apache.spark.internal.plugin.DriverPluginContainer.$anonfun$registerMetrics$1(PluginContainer.scala:75) at org.apache.spark.internal.plugin.DriverPluginContainer.$anonfun$registerMetrics$1$adapted(PluginContainer.scala:74) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.internal.plugin.DriverPluginContainer.registerMetrics(PluginContainer.scala:74) at org.apache.spark.SparkContext.$anonfun$new$41(SparkContext.scala:681) at org.apache.spark.SparkContext.$anonfun$new$41$adapted(SparkContext.scala:681) at scala.Option.foreach(Option.scala:407) at org.apache.spark.SparkContext.(SparkContext.scala:681) at org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:238) at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80) at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69) at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) at py4j.ClientServerConnection.run(ClientServerConnection.java:106) at java.lang.Thread.run(Thread.java:750)

Jan 29 '24 12:01 VitorNoro

Hi @VitorNoro! sorry for the long response

The issue with supporting AWS Glue with DataFlint OSS is that the Spark UI is not enabled on the cluster.

When you use the "Spark UI" it's actually a managed history server (which there is no way to run custom code on, such as DataFlint), that reads from a S3 Bucket events that the Spark Driver write to every 30 seconds.

See more at https://docs.aws.amazon.com/glue/latest/dg/monitor-spark-ui-jobs.html

You could host a history server yourself with DataFlint plugin installed (see instructions here https://dataflint.gitbook.io/dataflint-for-spark/getting-started/install-on-spark-history-server) and point it to the S3 bucket with the events. See instructions here: https://docs.aws.amazon.com/glue/latest/dg/monitor-spark-ui-history.html. You can also initially host this history server locally from your laptop to test DataFlint our.

Another options that I'm currently working on a SaaS offering for DataFlint, that will send the summary of your spark job to a SaaS solution with additional features (graph of job duration/resource usage/input size over time, recommendations, alerts etc...). In this SaaS portal when you select a job run you could also see it's Spark UI & DataFlint UI. This offering will also support AWS Glue.

If this is something that interests you please let me know.

Jan 31 '24 09:01 menishmueli

I'm keeping this issue open until I will add a better error message when trying to run DataFlint on AWS glue

Jan 31 '24 09:01 menishmueli

Thank you for the response! We'll consider our options, though it's likelier we move away from Glue in time.

Jan 31 '24 11:01 VitorNoro

Cool! if there anything else I can do to help you, you can contact me via the DataFlint slack community (join link in the README) or via linkedin (https://www.linkedin.com/in/meni-shmueli-developer/)

Jan 31 '24 11:01 menishmueli

Added an alert "No UI detected, skipping installation" if UI is turned off

May 19 '24 07:05 menishmueli

spark spark copied to clipboard

AWS Glue Compatibility

spark
spark copied to clipboard