enterprise_gateway
enterprise_gateway copied to clipboard
Can we read/write into HDFS/HIVE from spark scala kernel?
Hi, we are trying to read/write to hdfs/hive from spark scala kernel and we are failing to do so. Is read/write operations to hdfs/hive from spark scala kernel supported? if so can you please let us know?
In principle, you are supposed to be able to perform the same Spark operation you can do on the application in the Scala kernel as well. One caveat is that you need to make sure you are the expected user that has access to HDFS/Hive as depending on how you are using the kernel you might be a "default" or "generic" user.
With Jupyter Enterprise Gateway, we have the ability to run Jupyter kernels in cluster mode, where you can use "user impersonation" via Kerberos (which is usually what we see in production environments and the recommended security settings to use). In this case, you have the ability to run the kernels as the user that is requesting the new notebook instance and avoid any security mismatches.
Having said all that, you mention that you are having issues, but don't provide much info about what type of issues you are having, so my thoughts above are mostly about high-level capabilities and recommendations and might not apply depending on the type of issue you are having.
@suryag10 We are able to do this from a Python kernel. We have been running it for a year without any issues. It would help if you can share the exact issues that you are facing.
thanks lresende/nareshsankapelly, Following is the example:
val textFile = sc.textFile("hdfs://192.16.1.1:8020/tmp/scalakernelcreatereadfile.txt") val counts = textFile.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) print(counts.count())
the output is not displayed in the notebook and remains in hung state and there is no execution of new cells being executed.
we are using spark 2.4.5 and k8s as the resource manager
note that direct hdfs read/write from spark scala kernel is wroking, its only the spark read/write to HDFS timesout/hangs.
Hi, Any thoughts or suggestions on the same?
This is a bit harder to respond, maybe the logs could give us some hints? Does the EG, Spark, Hadoop logs show any related issues?