s3proxy support for azure datalake storage gen2
Hi All, we using apache spark to use s3proxy to read & write data using s3a:// api from azure datalake storage gen2. As I was reading through the s3proxy documentation, we can see that s3proxy supports azure blob storage for apache spark. However azure has one more flavor of storage - optimised for bigdata - i.e. azure datalake storage gen2 and there is no info about it on the s3proxy documentation. Can any one please help me with some details how to configure the s3proxy to use azure datalake storage gen2.
s3proxy.properties file:
s3proxy.endpoint=http://0.0.0.0:8080
s3proxy.authorization=aws-v2-or-v4
s3proxy.identity=local-identity
s3proxy.credential=local-credential
jclouds.provider=azureblob-sdk
jclouds.azureblob.auth=azureKey
jclouds.endpoint=https://testsa.blob.core.windows.net
jclouds.identity=testsa
jclouds.credential=
-
data.csv is available in adls gen2 container & spark is able to read the data. val df = spark.read.csv("s3a://data/s3proxy/data.csv") df.show()
-
But while doing write operation to adls gen2 using below commands. Getting error df.write.format("csv").option("header","true").save("s3a://data/s3proxy/data_new.csv")
error:
[s3proxy] W 03-10 11:42:48.599 S3Proxy-Jetty-54 o.g.s.o.e.j.server.HttpChannel:793 |::] handleException /data/s3proxy/data_new.compo 97 morrary/0/ java.io.IOException: com.azure.storage.blob.models.BlobStorageException: Status code 400, " RequestId:bb61fc5e-501e-005c-4cb1-91bbca000000 <Error><Code>Invaliduri</Code><Message>The requested URI does not represent any resource on the server. Time: 2025-03-10T11:42:48.60290012</Message></Error
Regards Ali
Azure Datalake uses a different API than Azure Blob: https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-directory-file-acl-java?tabs=azure-ad
S3Proxy does not support this. Adding this would be straightforward if you use the azureblob-sdk as a template: https://github.com/gaul/s3proxy/tree/master/src/main/java/org/gaul/s3proxy/azureblob