Elly.jl
Elly.jl copied to clipboard
Support for wasb:// protocol on Azure HDInsight
I can see the files using hadoop fs -ls but not using readdir. Trying to create a file reference for a file I know to exist using HDSFFile and then stat shows Elly.HDFSException("Path not found")
sshuser@hn0-myclust:~$ hadoop fs -ls /
Found 15 items
drwxr-xr-x - root supergroup 0 2018-02-07 14:25 /HdiSamples
drwxr-xr-x - hdfs supergroup 0 2018-02-07 14:15 /ams
drwxr-xr-x - hdfs supergroup 0 2018-02-07 14:15 /amshbase
drwxrwxrwx - yarn hadoop 0 2018-02-07 14:15 /app-logs
drwxr-xr-x - hdfs supergroup 0 2018-02-07 14:15 /apps
drwxr-xr-x - yarn hadoop 0 2018-02-07 14:15 /atshistory
drwxr-xr-x - root supergroup 0 2018-02-07 14:24 /custom-scriptaction-logs
drwxr-xr-x - root supergroup 0 2018-02-07 14:25 /example
drwxr-xr-x - hbase supergroup 0 2018-02-07 14:15 /hbase
drwxr-xr-x - hdfs supergroup 0 2018-02-07 14:15 /hdp
drwxr-xr-x - hdfs supergroup 0 2018-02-07 14:15 /hive
drwxr-xr-x - mapred supergroup 0 2018-02-07 14:15 /mapred
drwxrwxrwx - mapred hadoop 0 2018-02-07 14:15 /mr-history
drwxrwxrwx - hdfs supergroup 0 2018-02-07 14:15 /tmp
drwxr-xr-x - hdfs supergroup 0 2018-02-07 14:15 /user
sshuser@hn0-myclust:~$ julia
_
_ _ _(_)_ | A fresh approach to technical computing
(_) | (_) (_) | Documentation: https://docs.julialang.org
_ _ _| |_ __ _ | Type "?help" for help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 0.6.2 (2017-12-13 18:08 UTC)
_/ |\__'_|_|_|\__'_| | Official http://julialang.org/ release
|__/ | x86_64-pc-linux-gnu
julia> using Elly
julia> dfs = HDFSClient("hn0-myclust.3p0iyjauoc2e3faws152r5tm0e.cx.internal.cloudapp.net", 8020)
HDFSClient: sshuser@hn0-myclust.3p0iyjauoc2e3faws152r5tm0e.cx.internal.cloudapp.net:8020/
id: 76ba6c80-1ac9-45
connected: false
pwd: /
julia> readdir(dfs)
1-element Array{AbstractString,1}:
"tmp"
[Renamed the issue]
So this is due to the fact that Azure uses a separate wasb:// protocol layered over hdfs://, which uses azure blob store as the underlying storage. This will probably need to be supported explicitly within Elly.
Some background: https://blogs.msdn.microsoft.com/cindygross/2015/02/04/understanding-wasb-and-hadoop-storage-in-azure/
Similarly, HDInsight supports the adl:// protocol that uses Azure Data Lake Store as the underlying storage engine for hadoop. Would be good to support that as well.
related:
- https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-use-blob-storage
- https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-overview
Looks like this wasb support came in with Hadoop v2.9: https://hadoop.apache.org/docs/r2.9.0/hadoop-azure/index.html#Introduction
But what is not clear yet to me is whether the server will transparently wrap wasb and present a hdfs interface. If that is true then we should be able to access wasb by just upgrading Elly to use v2.9 protobuf apis. But I am still unsure how/why that would work. Will dig a bit deeper.
This looks like being entirely implemented as a client library - see org/apache/hadoop/fs/azure/NativeAzureFileSystem.html source.
It seems to be reading the hdfs config, but it interacts with azure services directly. The hdfs namenode and datanodes do not seem to be aware of this at all.
So, the implementation of HDFSFile in Elly.jl can cater only to hdfs:// filesystem. And we probably need to look at Azure apis to do an implementation of NativeAzureFile on similar lines in Julia. Also there doesn't seem to be any direct Azure API for this (wasb) filesystem protocol, only APIs for blobstore. We will need to implement the filesystem metadata management in Julia as well.