azure-sdk-for-cpp
azure-sdk-for-cpp copied to clipboard
DataLakeFileClient::Download/DataLakePathClient::GetProperties should not expect response headers only available in Blob Service API
Describe the bug Our application tries to use DataLakeFileClient::Download() and DataLakePathClient::GetProperties() to fetch file content / properties from ADLS gen2 storage, however it does not work because the implementation expects response headers which are not available in DataLake Filesystem API. (x-ms-blob-type and x-ms-creation-time)
Exception or Stack Trace
For DataLakePathClient::GetProperties()
std::map<std::string,std::string,Azure::Core::_internal::StringExtensions::CaseInsensitiveComparator,std::allocator<std::pair<std::string const ,std::string>>>::at(const std::string & _Keyval), Line 341
Azure::Storage::Blobs::_detail::BlobClient::GetProperties(Azure::Core::Http::_internal::HttpPipeline & pipeline, const Azure::Core::Url & url, const Azure::Storage::Blobs::_detail::BlobClient::GetBlobPropertiesOptions & options, const Azure::Core::Context & context), Line 3940
Azure::Storage::Blobs::BlobClient::GetProperties(const Azure::Storage::Blobs::GetBlobPropertiesOptions & options, const Azure::Core::Context & context), Line 533
Azure::Storage::Files::DataLake::DataLakePathClient::GetProperties(const Azure::Storage::Files::DataLake::GetPathPropertiesOptions & options, const Azure::Core::Context & context), Line 276
For DataLakeFileClient::Download()
std::map<std::string,std::string,Azure::Core::_internal::StringExtensions::CaseInsensitiveComparator,std::allocator<std::pair<std::string const ,std::string>>>::at(const std::string & _Keyval), Line 341
Azure::Storage::Blobs::_detail::BlobClient::Download(Azure::Core::Http::_internal::HttpPipeline & pipeline, const Azure::Core::Url & url, const Azure::Storage::Blobs::_detail::BlobClient::DownloadBlobOptions & options, const Azure::Core::Context & context), Line 3709
Azure::Storage::Blobs::BlobClient::Download(const Azure::Storage::Blobs::DownloadBlobOptions & options, const Azure::Core::Context & context), Line 198
Azure::Storage::Blobs::BlobClient::DownloadTo(unsigned char * buffer, unsigned __int64 bufferSize, const Azure::Storage::Blobs::DownloadBlobToOptions & options, const Azure::Core::Context & context), Line 315
Azure::Storage::Files::DataLake::DataLakeFileClient::DownloadTo(unsigned char * buffer, unsigned __int64 bufferSize, const Azure::Storage::Blobs::DownloadBlobToOptions & options, const Azure::Core::Context & context), Line 231
To Reproduce Calling DataLakeFileClient::Download() and DataLakePathClient::GetProperties() to fetch file content / properties from ADLS gen2 storage.
Code Snippet
void GetSize() final
{
const auto props = m_spDataLakeFileClient->GetProperties(AzureSDK::DataLake::GetPathPropertiesOptions{}, GetAzureSDKContextWithTimeout());
return props.Value.FileSize;
}
void ReadRange(const int64_t in_uOffset, const int64_t in_uSize, uint8_t* out_pBuffer)
{
const AzureSDK::Http::HttpRange range{ /* .Offset = */ in_uOffset, /* .Length = */ in_uSize };
const AzureSDK::DataLake::DownloadFileToOptions options{ /* .Range = */ range };
const auto response = m_spDataLakeFileClient->DownloadTo(out_pBuffer, in_uSize, options, GetAzureSDKContextWithTimeout());
const auto contentLength = response.Value.ContentRange.Length;
...
}
Expected behavior DataLakeFileClient::Download() and DataLakePathClient::GetProperties() should work. Currently we always get exception from the
Screenshots N/A
Setup (please complete the following information):
- OS: Windows 11 22H2
- IDE : Visual Studio 2019
- Version of the Library used:
azure-storage-blobs_12.6.0-beta.2
Additional context The issue still persists using latest azure-sdk-for-cpp. Code snippets encountering exception: https://github.com/Azure/azure-sdk-for-cpp/blob/ead6ac3de1f7ae8e09d320b563555760b586349b/sdk/storage/azure-storage-blobs/src/rest_client.cpp#L3732 https://github.com/Azure/azure-sdk-for-cpp/blob/ead6ac3de1f7ae8e09d320b563555760b586349b/sdk/storage/azure-storage-blobs/src/rest_client.cpp#L3841 https://github.com/Azure/azure-sdk-for-cpp/blob/ead6ac3de1f7ae8e09d320b563555760b586349b/sdk/storage/azure-storage-blobs/src/rest_client.cpp#L3964 https://github.com/Azure/azure-sdk-for-cpp/blob/ead6ac3de1f7ae8e09d320b563555760b586349b/sdk/storage/azure-storage-blobs/src/rest_client.cpp#L3977
Information Checklist Kindly make sure that you have added all the following information above and checkoff the required fields otherwise we will treat the issuer as an incomplete report
- [x] Bug Description Added
- [x] Repro Steps Added
- [x] Setup information Added
@bnmaa what kind of storage server are you running against? Are you using Azure Storage service?
@bnmaa what kind of storage server are you running against? Are you using Azure Storage service?
I know what's going on now
- The SDK code internally replaces
.dfs.
with.blob.
when constructing the BlobClient member of DataLakePathClient. (by callingAzure::Storage::Files::DataLake::_detail::GetBlobUrlFromUrl
) - We're actually sending request against a proxy server, which proxies the request to real Azure storage account (StorageV2), and proxies the respond back to our application.
That's why the "replacing url => expecting headers from Azure blob service" approach didn't work for us.
I can understand how SDK currently reuses BlobClient as part of DataLakePathClient's implementation detail. But I feel like it's not perfect to assume that every client is directly sending request against Azure Storage service's public endpoint.
@Jinming-Hu Could you and the dev-team evaluate if this is a feasible feature request? to fully decouple DataLake*Client from those blob-specific headers
@bnmaa It's a common scenario that people use azure storage through proxy or CDN etc. I agree our SDK should be able to handle this case.
Are you able to get around this issue before we can fix it?
@Jinming-Hu Yes, we have other workaround-like solution. Just that having this being supported in SDK would be the most ideal one. Thanks in advance!
Datalake service has two endpoints, .dfs.
one and .blob.
one. How would your proxy forward requests to these endpoints? Does it always forward requests to .dfs.
endpoint?
In order to forward the request correctly, the proxy server can differentiate the desired API set using hostname (or even just port number) difference.
For example:
https://myproxy.datalakesurface.net:5443/*
will be forwarded to https://adlsgen2account.dfs.core.windows.net/*
v.s.
https://myproxy.blobservicesurface.net:6443/*
will be forwarded to https://adlsgen2account.blob.core.windows.net/*
So we expect:
- feeding
https://myproxy.datalakesurface.net:5443/*
(forwarded to.dfs.
) toAzure::Storage::Files::DataLake::DataLake*Client
will work - feeding
https://myproxy.blobservicesurface.net:6443/*
(forwarded to.blob.
) toAzure::Storage::Blobs::*Blob*Client
will also work
But with current SDK implementation, 1 will fail, because the request is forwarded to .dfs.
, but the underlying DataLakePathClient::m_blobClient
still expects blob-specific response headers (x-ms-blob-type
and x-ms-creation-time
).
My aforementioned workaround is actually sticking to BlobClient for now.
But ultimately, we want to be able to switch to DataLakeFileClient
in some cases.
Hi @bnmaa I'd like to confirm one more thing. Is it a reverse proxy you're using? is the proxy able to see the url, request headers when deciding where to redirect the request?
@Jinming-Hu
Hi @bnmaa I'd like to confirm one more thing. Is it a reverse proxy you're using? is the proxy able to see the url, request headers when deciding where to redirect the request?
Yes, it is able to see the url and headers, but we'd like to avoid any customized logic inside the proxy server to work around the forwarding decision. Because that proxy server is owned by partner team, and there's other proxy consumers, they probably don't think the customized logic makes sense to them.
I don't think there's much we can do here. DataLake requests are sent to .dfs.
endpoint and .blob.
endpoint. This is by-design at the moment. Over the long term, all requests will switch to .blob.
endpoint.
If you're using reverse proxy, the endpoint information is already lost before the request is sent from client side.
I can come up with two workarounds:
- reverse proxy decides where to redirect the request based on url or request headers
- Use two reverse proxy domain names, one includes keyword
.dfs.
and forwards requests to azure storage dfs endpoint, the other includes.blob.
and forwards requests to azure storage blob endpoint, likefoo.bar.dfs.proxy.com
andfoo.bar.blob.proxy.com
, so that the domain name replacement logic still works
Hi @bnmaa. Thank you for opening this issue and giving us the opportunity to assist. We believe that this has been addressed. If you feel that further discussion is needed, please add a comment with the text “/unresolve
” to remove the “issue-addressed” label and continue the conversation.
Hi @bnmaa, since you haven’t asked that we “/unresolve
” the issue, we’ll close this out. If you believe further discussion is needed, please add a comment “/unresolve
” to reopen the issue.