bdutil
bdutil copied to clipboard
Spark eventlog directory points to GCS even if default_fs is set to hdfs
Right now spark.eventLog.dir gets set to a GCS path regardless of what DEFAULT_FS is set for deployment; this means if a deployment intentionally disables GCS accessibility, e.g. by removing external IP addresses, then even an HDFS-only setup doesn't work for Spark.
The temporary workaround is to manually edit spark.eventLog.dir on the master's /home/hadoop/spark-install/conf/spark-defaults.conf to something like hdfs:///spark-eventlog-base and to run hadoop fs -mkdir -p hdfs:///spark-eventlog-base, or to set spark.eventLog.enabled to false.
We can fix this to automatically derive the right path based on the default filesystem. Unfortunately Spark doesn't appear to correctly pick up the fs.default.name automatically for schemeless paths, possibly because of classloading ordering issues so that the path is resolved before default core-site.xml has been loaded; schemeless settings end up with something like:
java.lang.IllegalArgumentException: Log directory file:/spark-eventlog-base/dhuo-noip-m does not exist.