azure-sdk-for-cpp icon indicating copy to clipboard operation
azure-sdk-for-cpp copied to clipboard

DataLakeFileClient::Download/DataLakePathClient::GetProperties should not expect response headers only available in Blob Service API

Open bnmaa opened this issue 2 years ago • 7 comments

Describe the bug Our application tries to use DataLakeFileClient::Download() and DataLakePathClient::GetProperties() to fetch file content / properties from ADLS gen2 storage, however it does not work because the implementation expects response headers which are not available in DataLake Filesystem API. (x-ms-blob-type and x-ms-creation-time)

Exception or Stack Trace For DataLakePathClient::GetProperties()

std::map<std::string,std::string,Azure::Core::_internal::StringExtensions::CaseInsensitiveComparator,std::allocator<std::pair<std::string const ,std::string>>>::at(const std::string & _Keyval), Line 341
Azure::Storage::Blobs::_detail::BlobClient::GetProperties(Azure::Core::Http::_internal::HttpPipeline & pipeline, const Azure::Core::Url & url, const Azure::Storage::Blobs::_detail::BlobClient::GetBlobPropertiesOptions & options, const Azure::Core::Context & context), Line 3940
Azure::Storage::Blobs::BlobClient::GetProperties(const Azure::Storage::Blobs::GetBlobPropertiesOptions & options, const Azure::Core::Context & context), Line 533
Azure::Storage::Files::DataLake::DataLakePathClient::GetProperties(const Azure::Storage::Files::DataLake::GetPathPropertiesOptions & options, const Azure::Core::Context & context), Line 276

For DataLakeFileClient::Download()

std::map<std::string,std::string,Azure::Core::_internal::StringExtensions::CaseInsensitiveComparator,std::allocator<std::pair<std::string const ,std::string>>>::at(const std::string & _Keyval), Line 341
Azure::Storage::Blobs::_detail::BlobClient::Download(Azure::Core::Http::_internal::HttpPipeline & pipeline, const Azure::Core::Url & url, const Azure::Storage::Blobs::_detail::BlobClient::DownloadBlobOptions & options, const Azure::Core::Context & context), Line 3709
Azure::Storage::Blobs::BlobClient::Download(const Azure::Storage::Blobs::DownloadBlobOptions & options, const Azure::Core::Context & context), Line 198
Azure::Storage::Blobs::BlobClient::DownloadTo(unsigned char * buffer, unsigned __int64 bufferSize, const Azure::Storage::Blobs::DownloadBlobToOptions & options, const Azure::Core::Context & context), Line 315
Azure::Storage::Files::DataLake::DataLakeFileClient::DownloadTo(unsigned char * buffer, unsigned __int64 bufferSize, const Azure::Storage::Blobs::DownloadBlobToOptions & options, const Azure::Core::Context & context), Line 231

To Reproduce Calling DataLakeFileClient::Download() and DataLakePathClient::GetProperties() to fetch file content / properties from ADLS gen2 storage.

Code Snippet

	void GetSize() final
	{
		const auto props = m_spDataLakeFileClient->GetProperties(AzureSDK::DataLake::GetPathPropertiesOptions{}, GetAzureSDKContextWithTimeout());
		return props.Value.FileSize;
	}
	void ReadRange(const int64_t in_uOffset, const int64_t in_uSize, uint8_t* out_pBuffer)
	{
		const AzureSDK::Http::HttpRange range{ /* .Offset = */ in_uOffset, /* .Length = */ in_uSize };
		const AzureSDK::DataLake::DownloadFileToOptions options{ /* .Range = */ range };

		const auto response = m_spDataLakeFileClient->DownloadTo(out_pBuffer, in_uSize, options, GetAzureSDKContextWithTimeout());
		const auto contentLength = response.Value.ContentRange.Length;
		...
	}

Expected behavior DataLakeFileClient::Download() and DataLakePathClient::GetProperties() should work. Currently we always get exception from the

Screenshots N/A

Setup (please complete the following information):

  • OS: Windows 11 22H2
  • IDE : Visual Studio 2019
  • Version of the Library used: azure-storage-blobs_12.6.0-beta.2

Additional context The issue still persists using latest azure-sdk-for-cpp. Code snippets encountering exception: https://github.com/Azure/azure-sdk-for-cpp/blob/ead6ac3de1f7ae8e09d320b563555760b586349b/sdk/storage/azure-storage-blobs/src/rest_client.cpp#L3732 https://github.com/Azure/azure-sdk-for-cpp/blob/ead6ac3de1f7ae8e09d320b563555760b586349b/sdk/storage/azure-storage-blobs/src/rest_client.cpp#L3841 https://github.com/Azure/azure-sdk-for-cpp/blob/ead6ac3de1f7ae8e09d320b563555760b586349b/sdk/storage/azure-storage-blobs/src/rest_client.cpp#L3964 https://github.com/Azure/azure-sdk-for-cpp/blob/ead6ac3de1f7ae8e09d320b563555760b586349b/sdk/storage/azure-storage-blobs/src/rest_client.cpp#L3977

Information Checklist Kindly make sure that you have added all the following information above and checkoff the required fields otherwise we will treat the issuer as an incomplete report

  • [x] Bug Description Added
  • [x] Repro Steps Added
  • [x] Setup information Added

bnmaa avatar Oct 14 '22 08:10 bnmaa

@bnmaa what kind of storage server are you running against? Are you using Azure Storage service?

Jinming-Hu avatar Oct 17 '22 04:10 Jinming-Hu

@bnmaa what kind of storage server are you running against? Are you using Azure Storage service?

I know what's going on now

  1. The SDK code internally replaces .dfs. with .blob. when constructing the BlobClient member of DataLakePathClient. (by calling Azure::Storage::Files::DataLake::_detail::GetBlobUrlFromUrl)
  2. We're actually sending request against a proxy server, which proxies the request to real Azure storage account (StorageV2), and proxies the respond back to our application.

That's why the "replacing url => expecting headers from Azure blob service" approach didn't work for us.

I can understand how SDK currently reuses BlobClient as part of DataLakePathClient's implementation detail. But I feel like it's not perfect to assume that every client is directly sending request against Azure Storage service's public endpoint.

@Jinming-Hu Could you and the dev-team evaluate if this is a feasible feature request? to fully decouple DataLake*Client from those blob-specific headers

bnmaa avatar Oct 17 '22 10:10 bnmaa

@bnmaa It's a common scenario that people use azure storage through proxy or CDN etc. I agree our SDK should be able to handle this case.

Are you able to get around this issue before we can fix it?

Jinming-Hu avatar Oct 17 '22 14:10 Jinming-Hu

@Jinming-Hu Yes, we have other workaround-like solution. Just that having this being supported in SDK would be the most ideal one. Thanks in advance!

bnmaa avatar Oct 18 '22 00:10 bnmaa

Datalake service has two endpoints, .dfs. one and .blob. one. How would your proxy forward requests to these endpoints? Does it always forward requests to .dfs. endpoint?

Jinming-Hu avatar Oct 18 '22 08:10 Jinming-Hu

In order to forward the request correctly, the proxy server can differentiate the desired API set using hostname (or even just port number) difference. For example: https://myproxy.datalakesurface.net:5443/* will be forwarded to https://adlsgen2account.dfs.core.windows.net/* v.s. https://myproxy.blobservicesurface.net:6443/* will be forwarded to https://adlsgen2account.blob.core.windows.net/*

So we expect:

  1. feeding https://myproxy.datalakesurface.net:5443/* (forwarded to .dfs.) to Azure::Storage::Files::DataLake::DataLake*Client will work
  2. feeding https://myproxy.blobservicesurface.net:6443/* (forwarded to .blob.) to Azure::Storage::Blobs::*Blob*Client will also work

But with current SDK implementation, 1 will fail, because the request is forwarded to .dfs., but the underlying DataLakePathClient::m_blobClient still expects blob-specific response headers (x-ms-blob-type and x-ms-creation-time).

My aforementioned workaround is actually sticking to BlobClient for now. But ultimately, we want to be able to switch to DataLakeFileClient in some cases.

bnmaa avatar Oct 18 '22 14:10 bnmaa

Hi @bnmaa I'd like to confirm one more thing. Is it a reverse proxy you're using? is the proxy able to see the url, request headers when deciding where to redirect the request?

Jinming-Hu avatar Oct 21 '22 03:10 Jinming-Hu

@Jinming-Hu

Hi @bnmaa I'd like to confirm one more thing. Is it a reverse proxy you're using? is the proxy able to see the url, request headers when deciding where to redirect the request?

Yes, it is able to see the url and headers, but we'd like to avoid any customized logic inside the proxy server to work around the forwarding decision. Because that proxy server is owned by partner team, and there's other proxy consumers, they probably don't think the customized logic makes sense to them.

bnmaa avatar Nov 01 '22 02:11 bnmaa

I don't think there's much we can do here. DataLake requests are sent to .dfs. endpoint and .blob. endpoint. This is by-design at the moment. Over the long term, all requests will switch to .blob. endpoint.

If you're using reverse proxy, the endpoint information is already lost before the request is sent from client side.

I can come up with two workarounds:

  1. reverse proxy decides where to redirect the request based on url or request headers
  2. Use two reverse proxy domain names, one includes keyword .dfs. and forwards requests to azure storage dfs endpoint, the other includes .blob. and forwards requests to azure storage blob endpoint, like foo.bar.dfs.proxy.com and foo.bar.blob.proxy.com, so that the domain name replacement logic still works

Jinming-Hu avatar Nov 01 '22 02:11 Jinming-Hu

Hi @bnmaa. Thank you for opening this issue and giving us the opportunity to assist. We believe that this has been addressed. If you feel that further discussion is needed, please add a comment with the text “/unresolve” to remove the “issue-addressed” label and continue the conversation.

ghost avatar Feb 22 '23 08:02 ghost

Hi @bnmaa, since you haven’t asked that we “/unresolve” the issue, we’ll close this out. If you believe further discussion is needed, please add a comment “/unresolve” to reopen the issue.

ghost avatar Mar 01 '23 10:03 ghost