arrow icon indicating copy to clipboard operation
arrow copied to clipboard

[Python] pyarrow.fs.HadoopFileSystem cannot access Azure Data Lake (ADLS)

Open asfimport opened this issue 5 years ago • 5 comments

It's not possible to open a abfs:// or abfss:// URI with the pyarrow.fs.HadoopFileSystem.

Using HadoopFileSystem.from_uri(path) does not work and libhdfs will throw an error saying that the authority is invalid (I checked that this is because the string is empty).

Note that the legacy pyarrow.hdfs.HadoopFileSystem interface works by doing for example:

Reporter: Juan Galvez

Note: This issue was originally created as ARROW-10872. Please see the migration documentation for further details.

asfimport avatar Dec 10 '20 15:12 asfimport

Joris Van den Bossche / @jorisvandenbossche: [~jjgalvez] thanks a lot for the report!

It's difficult for me to test whether your suggestion would work (and for other arrow developers as well, since we often don't have a Hadoop or Azure filesystem at our disposal to test). But would you be able to try your suggestion yourself, and see it that works for you? A PR would then also be very welcome.

cc @kszucs

asfimport avatar Dec 14 '20 13:12 asfimport

Steve Loughran: this problem would also surface if file:// was used as the source URL, which may permit local replication. (Note, MiniDFSCluster is something in the hadoop-hdfs test JAR to let you bring up an HDFS cluster in process purely for testing)

asfimport avatar Jun 22 '21 15:06 asfimport

ABFS URIs take the following form: abfs://<container_name>@<account_name>.dfs.core.windows.net

It looks like the sanitisation that's done as part of the from_uri method ends up changing it to: abfs://<account_name>.dfs.core.windows.net

This can be seen in the error returned – it is missing the container name.

CC: hdfs.cc (not familiar with this codebase so I may have picked up the wrong codepath)

A similar exception can be found using the Java client:

scala> FileSystem.get(new URI("abfs://bogus.dfs.core.windows.net"), new Configuration())
23/06/02 14:50:26 WARN fs.FileSystem: Failed to initialize fileystem abfs://bogus.dfs.core.windows.net: abfs://bogus.dfs.core.windows.net has invalid authority.
org.apache.hadoop.fs.azurebfs.contracts.exceptions.InvalidUriAuthorityException: abfs://bogus.dfs.core.windows.net has invalid authority.
  at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.authorityParts(AzureBlobFileSystemStore.java:334)
  at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.<init>(AzureBlobFileSystemStore.java:202)
  at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.initialize(AzureBlobFileSystem.java:195)
  at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3452)
  at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:162)
  at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3557)
  at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3504)
  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:522)
  ... 59 elided

Interestingly, this all appears to happen before a connection to Azure is attempted so you may not need an ADLSgen2 container to validate this particular issue.

If we include a valid authority, the FileSystem is returned:

scala> FileSystem.get(new URI("abfs://[email protected]"), new Configuration())
res0: org.apache.hadoop.fs.FileSystem = AzureBlobFileSystem{uri=abfs://[email protected], user='wdyson', primaryUserGroup='wdyson'[fs.azure.capability.readahead.safe]}

The wrapper around libhdfs should be modified to retain the container name before the @.

WillDyson avatar Jun 02 '23 14:06 WillDyson

Here's the same example using libhdfs:

#include <stdio.h>
#include <stdlib.h>
#include "hdfs.h"

int main(int argc, char **argv) {
    printf("### Test with container name\n");
    hdfsConnect("abfs://[email protected]", 0);
    printf("### Test without container name\n");
    hdfsConnect("abfs://bogus.dfs.core.windows.net", 0);
}
### Test with container name
23/06/02 15:24:56 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
### Test without container name
23/06/02 15:24:57 WARN fs.FileSystem: Failed to initialize fileystem abfs://bogus.dfs.core.windows.net: abfs://bogus.dfs.core.windows.net has invalid authority.
hdfsBuilderConnect(forceNewInstance=0, nn=abfs://bogus.dfs.core.windows.net, port=0, kerbTicketCachePath=(NULL), userName=(NULL)) error:
InvalidUriAuthorityException: abfs://bogus.dfs.core.windows.net has invalid authority.abfs://bogus.dfs.core.windows.net has invalid authority.
        at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.authorityParts(AzureBlobFileSystemStore.java:334)
        at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.<init>(AzureBlobFileSystemStore.java:202)
        at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.initialize(AzureBlobFileSystem.java:195)
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3452)
        at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:162)
        at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3557)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3504)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:522)
        at org.apache.hadoop.fs.FileSystem$1.run(FileSystem.java:260)
        at org.apache.hadoop.fs.FileSystem$1.run(FileSystem.java:257)
        at java.base/java.security.AccessController.doPrivileged(Native Method)
        at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:257)

Similarly to the previous case, the behaviour is the same regardless of whether the ADLSgen2 storage account actually exists or not.

WillDyson avatar Jun 02 '23 15:06 WillDyson

Hello,

I would like to express my willingness to contribute a fix for the bug in the Delta Lake code base. I can contribute a fix for this bug independently.

Thank you for the opportunity!

Best regards, Pragy Shukla

Pshak-20000 avatar Oct 21 '24 12:10 Pshak-20000