datahub icon indicating copy to clipboard operation
datahub copied to clipboard

Use Spark to write the dataframe to hudi and get the error ERROR DatahubSparkListener: java.lang.NullPointerException

Open CaesarWangX opened this issue 2 years ago • 1 comments

We want to get the lineage of the spark job

Our env is emr Spark version is 3.1.2 Hudi version is 0.8.0 datahub version is 0.8.45

Our spark job is to write the data to the hudi after processing the data in the read area As a result, we only get a pipline on the datahub, which cannot get the lineage image

And we found the following errors in the job log 22/10/21 07:05:24 WARN DFSPropertiesConfiguration: Cannot find HUDI_CONF_DIR, please set it as the dir of hudi-defaults.conf 22/10/21 07:05:24 ERROR DatahubSparkListener: java.lang.NullPointerException at datahub.spark.DatasetExtractor.lambda$static$6(DatasetExtractor.java:143) at datahub.spark.DatasetExtractor.asDataset(DatasetExtractor.java:228) at datahub.spark.DatahubSparkListener$SqlStartTask.run(DatahubSparkListener.java:114) at datahub.spark.DatahubSparkListener.processExecution(DatahubSparkListener.java:350) at datahub.spark.DatahubSparkListener.onOtherEvent(DatahubSparkListener.java:262) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:100) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:117) at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:101) at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105) at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105) at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1381) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96)

CaesarWangX avatar Oct 21 '22 07:10 CaesarWangX

This issue is stale because it has been open for 30 days with no activity. If you believe this is still an issue on the latest DataHub release please leave a comment with the version that you tested it with. If this is a question/discussion please head to https://slack.datahubproject.io. For feature requests please use https://feature-requests.datahubproject.io

github-actions[bot] avatar Nov 21 '22 02:11 github-actions[bot]

Hi @CaesarWangX this seems like a troubleshooting issue, rather than a bug. We're happy to provide community support on our Slack channel, but currently reserve git issues for bugs.

If you're still having trouble, please join us at [slack.datahubproject.io](https://slack.datahubproject.io) and we can troubleshoot there. For now, I'm going to close this issue.

laulpogan avatar Dec 07 '22 20:12 laulpogan