datahub
datahub copied to clipboard
Use Spark to write the dataframe to hudi and get the error ERROR DatahubSparkListener: java.lang.NullPointerException
We want to get the lineage of the spark job
Our env is emr Spark version is 3.1.2 Hudi version is 0.8.0 datahub version is 0.8.45
Our spark job is to write the data to the hudi after processing the data in the read area
As a result, we only get a pipline on the datahub, which cannot get the lineage
And we found the following errors in the job log
22/10/21 07:05:24 WARN DFSPropertiesConfiguration: Cannot find HUDI_CONF_DIR, please set it as the dir of hudi-defaults.conf 22/10/21 07:05:24 ERROR DatahubSparkListener: java.lang.NullPointerException at datahub.spark.DatasetExtractor.lambda$static$6(DatasetExtractor.java:143) at datahub.spark.DatasetExtractor.asDataset(DatasetExtractor.java:228) at datahub.spark.DatahubSparkListener$SqlStartTask.run(DatahubSparkListener.java:114) at datahub.spark.DatahubSparkListener.processExecution(DatahubSparkListener.java:350) at datahub.spark.DatahubSparkListener.onOtherEvent(DatahubSparkListener.java:262) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:100) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:117) at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:101) at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105) at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105) at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1381) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96)
This issue is stale because it has been open for 30 days with no activity. If you believe this is still an issue on the latest DataHub release please leave a comment with the version that you tested it with. If this is a question/discussion please head to https://slack.datahubproject.io. For feature requests please use https://feature-requests.datahubproject.io
Hi @CaesarWangX this seems like a troubleshooting issue, rather than a bug. We're happy to provide community support on our Slack channel, but currently reserve git issues for bugs.
If you're still having trouble, please join us at [slack.datahubproject.io](https://slack.datahubproject.io) and we can troubleshoot there. For now, I'm going to close this issue.