enterprise_gateway icon indicating copy to clipboard operation
enterprise_gateway copied to clipboard

Can we read/write into HDFS/HIVE from spark scala kernel?

Open suryag10 opened this issue 4 years ago • 6 comments

Hi, we are trying to read/write to hdfs/hive from spark scala kernel and we are failing to do so. Is read/write operations to hdfs/hive from spark scala kernel supported? if so can you please let us know?

suryag10 avatar Apr 16 '20 11:04 suryag10

In principle, you are supposed to be able to perform the same Spark operation you can do on the application in the Scala kernel as well. One caveat is that you need to make sure you are the expected user that has access to HDFS/Hive as depending on how you are using the kernel you might be a "default" or "generic" user.

With Jupyter Enterprise Gateway, we have the ability to run Jupyter kernels in cluster mode, where you can use "user impersonation" via Kerberos (which is usually what we see in production environments and the recommended security settings to use). In this case, you have the ability to run the kernels as the user that is requesting the new notebook instance and avoid any security mismatches.

Having said all that, you mention that you are having issues, but don't provide much info about what type of issues you are having, so my thoughts above are mostly about high-level capabilities and recommendations and might not apply depending on the type of issue you are having.

lresende avatar Apr 16 '20 17:04 lresende

@suryag10 We are able to do this from a Python kernel. We have been running it for a year without any issues. It would help if you can share the exact issues that you are facing.

nareshsankapelly avatar Apr 17 '20 02:04 nareshsankapelly

thanks lresende/nareshsankapelly, Following is the example:

val textFile = sc.textFile("hdfs://192.16.1.1:8020/tmp/scalakernelcreatereadfile.txt") val counts = textFile.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) print(counts.count())

the output is not displayed in the notebook and remains in hung state and there is no execution of new cells being executed.

we are using spark 2.4.5 and k8s as the resource manager

suryag10 avatar Apr 18 '20 13:04 suryag10

note that direct hdfs read/write from spark scala kernel is wroking, its only the spark read/write to HDFS timesout/hangs.

suryag10 avatar Apr 20 '20 03:04 suryag10

Hi, Any thoughts or suggestions on the same?

suryag10 avatar Apr 22 '20 11:04 suryag10

This is a bit harder to respond, maybe the logs could give us some hints? Does the EG, Spark, Hadoop logs show any related issues?

lresende avatar Apr 22 '20 14:04 lresende