arrow icon indicating copy to clipboard operation
arrow copied to clipboard

ARROW-2034: [C++] Filesystem implementation for Azure Blob Storage

Open shefali163 opened this issue 2 years ago • 16 comments

shefali163 avatar Apr 18 '22 17:04 shefali163

https://issues.apache.org/jira/browse/ARROW-2034

github-actions[bot] avatar Apr 18 '22 17:04 github-actions[bot]

:warning: Ticket has not been started in JIRA, please click 'Start Progress'.

github-actions[bot] avatar Apr 18 '22 17:04 github-actions[bot]

@github-actions autotune

kou avatar Jun 01 '22 04:06 kou

@github-actions rebase

kou avatar Jun 01 '22 04:06 kou

Hi @kou, Currently ARROW_AZURE is turned OFF for all the PR builds, and azurefs.cc/azurefs_mock.cc is not being included in the target, For which particular builds we should turn this ON?

shefali163 avatar Jun 02 '22 16:06 shefali163

Could you enable it in .github/workflows/cpp.yml (Linux, macOS and Windows) and ci/docker/ubuntu-*-cpp.dockerfile?

kou avatar Jun 03 '22 01:06 kou

Rebase should fix the R build failures, though there are a couple of merge conflicts to resolve

nealrichardson avatar Jun 14 '22 15:06 nealrichardson

We were going to add Azurite for testing AzureFileSystem but while using this we faced some issues(similar to https://github.com/Azure/Azurite/issues/1101), We are currently working on resolving this, MockAzureFileSystem is a temporary way of testing AzureFileSystem as it copies the logic of the APIs which provides some level of testing, but we are planning to add Azurite in the upcoming PRs, Let me know if you think MockAzureFileSystem won't work for this PR. @kou

shefali163 avatar Jun 16 '22 05:06 shefali163

Does Azurite have issues for all operations or only some operations? If it's the latter, how about enabling only tests that use stable operations?

I don't think that duplicating codes for testing with AzureFileSystem and MockAzureFileSystem is a good idea even if it's just a temporary solution... For example, copy problems may not be found by review...

kou avatar Jun 16 '22 06:06 kou

Azurite has issues with the SDK clients and some endpoints like GetProperties etc. which are used in the majority of operations, If MockAzureFileSystem cannot be used then I will check if there are any stable tests and will work on resolving the issues with Azurite.

shefali163 avatar Jun 16 '22 07:06 shefali163

Hi @kou, I have added an install script for azurite but this seems to fail with a permission denied error(log), do we need to add this script somewhere else as well?

Also, the windows build fails with a linker error(log), Are we not supposed to use azurefs_objlib with WIN32? Ref: https://github.com/apache/arrow/blob/master/cpp/cmake_modules/BuildUtils.cmake#L205

shefali163 avatar Jun 25 '22 17:06 shefali163

I have added an install script for azurite but this seems to fail with a permission denied error(log), do we need to add this script somewhere else as well?

chmod +x ci/scripts/install_azurite.sh && git add ci/scripts/install_azurite.sh && git commit will resolve it.

Also, the windows build fails with a linker error(log), Are we not supposed to use azurefs_objlib with WIN32? Ref: https://github.com/apache/arrow/blob/master/cpp/cmake_modules/BuildUtils.cmake#L205

This isn't related to object library. It's caused by building Azure SDK for C++. It's a problem of Azure SDK for C++. https://github.com/Azure/azure-sdk-for-cpp/blob/80cf3d09fdebbf4a427a0a8b89d054250bb48ec9/sdk/storage/azure-storage-common/CMakeLists.txt#L98 should be executed only in if(MSVC) not if(WIN32). Could you report this to Azure SDK for C++? (Note that we should not change https://github.com/Azure/azure-sdk-for-cpp/blob/80cf3d09fdebbf4a427a0a8b89d054250bb48ec9/sdk/storage/azure-storage-common/CMakeLists.txt#L95 to if(MSVC) because there is one more statement: https://github.com/Azure/azure-sdk-for-cpp/blob/80cf3d09fdebbf4a427a0a8b89d054250bb48ec9/sdk/storage/azure-storage-common/CMakeLists.txt#L96 )

kou avatar Jun 25 '22 20:06 kou

Keeping ARROW_AZURE : OFF in windows-mingw build for now

shefali163 avatar Jun 26 '22 21:06 shefali163

Keeping ARROW_AZURE : OFF in windows-mingw build for now

OK. But could you report this problem to upstream to enable on MinGW in the future?

kou avatar Jun 28 '22 05:06 kou

I'd like to give this a review before it gets merged, if possible.

Le 10 août 2022 10:28:06 Sutou Kouhei @.***> a écrit :

@kou commented on this pull request. Could you rebase on master to resolve conflicts? Could you tell me your account on Jira? I want to assign you to https://issues.apache.org/jira/browse/ARROW-2034 before we merge this. Honestly, I can't review all changes carefully in this pull request because this is too large for me. I'll merge this after we address all comments from me. Problems I missed in this pull request or newly found problems after we merge can be fixed by follow-up pull requests.

In cpp/thirdparty/versions.txt:

+ARROW_AZURE_STORAGE_BLOB_BUILD_VERSION=12.4.0 +ARROW_AZURE_STORAGE_BLOB_BUILD_SHA256_CHECKSUM=ce77055ff5e1b88826a89a29399ffbdcdc77beca1eae61c81f34a3f6e0a20715 ⬇️ Suggested change -ARROW_AZURE_STORAGE_BLOB_BUILD_VERSION=12.4.0 -ARROW_AZURE_STORAGE_BLOB_BUILD_SHA256_CHECKSUM=ce77055ff5e1b88826a89a29399ffbdcdc77beca1eae61c81f34a3f6e0a20715 +ARROW_AZURE_STORAGE_BLOBS_BUILD_VERSION=12.4.0 +ARROW_AZURE_STORAGE_BLOBS_BUILD_SHA256_CHECKSUM=ce77055ff5e1b88826a89a29399ffbdcdc77beca1eae61c81f34a3f6e0a20715

In cpp/src/arrow/filesystem/azurefs.cc:

  • if (this->is_azurite) {
  • account_blob_url = "http://127.0.0.1:10000/" + account_name + "/";
  • account_dfs_url = "http://127.0.0.1:10000/" + account_name + "/";
  • } else {
  • account_dfs_url = "https://" + account_name + ".dfs.core.windows.net/";
  • account_blob_url = "https://" + account_name + ".blob.core.windows.net/";
  • }
  • storage_credentials_provider =
  • std::make_sharedAzure::Storage::StorageSharedKeyCredential(account_name,
  • account_key);
  • credentials_kind = AzureCredentialsKind::StorageCredentials;
  • return Status::OK(); +}

+Status AzureOptions::ConfigureConnectionStringCredentials(

  • const std::string& connection_string_uri) { It seems that connection string isn't an URI: https://docs.microsoft.com/en-us/azure/storage/common/storage-configure-connection-string

In cpp/src/arrow/filesystem/azurefs.cc:

  • // Ensure container exists
  • ARROW_ASSIGN_OR_RAISE(bool container_exists, impl_->ContainerExists(path.container));
  • if (!container_exists) {
  • RETURN_NOT_OK(impl_->CreateContainer(path.container));
  • }
  • std::vectorstd::string parent_path_to_file;
  • for (const auto& part : path.path_to_file_parts) {
  • parent_path_to_file.push_back(part);
  • RETURN_NOT_OK(impl_->CreateEmptyDir(path.container, parent_path_to_file));
  • }
  • return Status::OK();
  • } else {
  • // Check parent dir exists
  • if (path.has_parent()) {
  • AzurePath parent_path = path.parent(); ⬇️ Suggested change
  • AzurePath parent_path = path.parent();
  • auto parent_path = path.parent();

In cpp/thirdparty/versions.txt:

@@ -33,6 +33,16 @@ ARROW_AWS_C_COMMON_BUILD_VERSION=v0.6.9 ARROW_AWS_C_COMMON_BUILD_SHA256_CHECKSUM=928a3e36f24d1ee46f9eec360ec5cebfe8b9b8994fe39d4fa74ff51aebb12717 ARROW_AWS_C_EVENT_STREAM_BUILD_VERSION=v0.1.5 ARROW_AWS_C_EVENT_STREAM_BUILD_SHA256_CHECKSUM=f1b423a487b5d6dca118bfc0d0c6cc596dc476b282258a3228e73a8f730422d4 +ARROW_AZURE_CORE_BUILD_VERSION=1.5.0 +ARROW_AZURE_CORE_BUILD_SHA256_CHECKSUM=dab2caa54d062b61dbe982e29a4f1fcc70216b51b038a807763712a40dd258e9 +ARROW_AZURE_IDENTITY_BUILD_VERSION=1.2.0 +ARROW_AZURE_IDENTITY_BUILD_SHA256_CHECKSUM=ad4702890c25f956c59a63be4571a08ae0690fa6d2bfbebf326d0fd2e9b72945 +ARROW_AZURE_STORAGE_BLOB_BUILD_VERSION=12.4.0 +ARROW_AZURE_STORAGE_BLOB_BUILD_SHA256_CHECKSUM=ce77055ff5e1b88826a89a29399ffbdcdc77beca1eae61c81f34a3f6e0a20715 +ARROW_AZURE_STORAGE_COMMON_BUILD_VERSION=12.2.3 +ARROW_AZURE_STORAGE_COMMON_BUILD_SHA256_CHECKSUM=2d58e9c314b1b32f7d09880239a4ecce6686ed6df236a58f681ae5d526ed6201 +ARROW_AZURE_STORAGE_FILES_DATALAKE_BUILD_VERSION=12.3.1 +ARROW_AZURE_STORAGE_FILES_DATALAKE_BUILD_SHA256_CHECKSUM=a5b74076a751d7cfaf7c56674a40ce2792c4fab9add18758fab1fe091d00baff Could you update to the latest versions? I got segmentation fault with these versions on my Debian GNU/Linux sid environment. diff --git a/cpp/cmake_modules/ThirdpartyToolchain.cmake b/cpp/cmake_modules/ThirdpartyToolchain.cmake index 3a555b0c4b..73d582a7a9 100644 --- a/cpp/cmake_modules/ThirdpartyToolchain.cmake +++ b/cpp/cmake_modules/ThirdpartyToolchain.cmake @@ -4841,14 +4841,15 @@ macro(build_azuresdk) set(AZURESDK_COMMON_CMAKE_ARGS ${EP_COMMON_CMAKE_ARGS}

  • "-DCMAKE_INSTALL_PREFIX=${AZURESDK_PREFIX}"
  • "-DCMAKE_PREFIX_PATH=${AZURESDK_PREFIX}" -DBUILD_SHARED_LIBS=OFF -DCMAKE_INSTALL_LIBDIR=${AZURESDK_LIB_DIR}
  • -DDISABLE_AZURE_CORE_OPENTELEMETRY=ON -DENABLE_TESTING=OFF -DENABLE_UNITY_BUILD=ON -DOPENSSL_ROOT_DIR=${OPENSSL_ROOT_HINT}
  • -DWARNINGS_AS_ERRORS=OFF
  • "-DCMAKE_INSTALL_PREFIX=${AZURESDK_PREFIX}"
  • "-DCMAKE_PREFIX_PATH=${AZURESDK_PREFIX}")
  • -DWARNINGS_AS_ERRORS=OFF) file(MAKE_DIRECTORY ${AZURESDK_INCLUDE_DIR}) diff --git a/cpp/thirdparty/versions.txt b/cpp/thirdparty/versions.txt index a8797de94b..1f21385a78 100644 --- a/cpp/thirdparty/versions.txt +++ b/cpp/thirdparty/versions.txt @@ -33,14 +33,14 @@ ARROW_AWS_C_COMMON_BUILD_VERSION=v0.6.9 ARROW_AWS_C_COMMON_BUILD_SHA256_CHECKSUM=928a3e36f24d1ee46f9eec360ec5cebfe8b9b8994fe39d4fa74ff51aebb12717 ARROW_AWS_C_EVENT_STREAM_BUILD_VERSION=v0.1.5 ARROW_AWS_C_EVENT_STREAM_BUILD_SHA256_CHECKSUM=f1b423a487b5d6dca118bfc0d0c6cc596dc476b282258a3228e73a8f730422d4 -ARROW_AZURE_CORE_BUILD_VERSION=1.5.0 -ARROW_AZURE_CORE_BUILD_SHA256_CHECKSUM=dab2caa54d062b61dbe982e29a4f1fcc70216b51b038a807763712a40dd258e9 -ARROW_AZURE_IDENTITY_BUILD_VERSION=1.2.0 -ARROW_AZURE_IDENTITY_BUILD_SHA256_CHECKSUM=ad4702890c25f956c59a63be4571a08ae0690fa6d2bfbebf326d0fd2e9b72945 -ARROW_AZURE_STORAGE_BLOB_BUILD_VERSION=12.4.0 -ARROW_AZURE_STORAGE_BLOB_BUILD_SHA256_CHECKSUM=ce77055ff5e1b88826a89a29399ffbdcdc77beca1eae61c81f34a3f6e0a20715 -ARROW_AZURE_STORAGE_COMMON_BUILD_VERSION=12.2.3 -ARROW_AZURE_STORAGE_COMMON_BUILD_SHA256_CHECKSUM=2d58e9c314b1b32f7d09880239a4ecce6686ed6df236a58f681ae5d526ed6201 +ARROW_AZURE_CORE_BUILD_VERSION=1.7.1 +ARROW_AZURE_CORE_BUILD_SHA256_CHECKSUM=ae6f03e65d9773d11cf3b9619d0bc7f567272974cf31b9e1c8ca2fa0ea4fb4c6 +ARROW_AZURE_IDENTITY_BUILD_VERSION=1.3.0 +ARROW_AZURE_IDENTITY_BUILD_SHA256_CHECKSUM=46701acd8000f317d1c4b33263d5d3203924fadcfa5af4860ae9187046a72c45 +ARROW_AZURE_STORAGE_BLOB_BUILD_VERSION=12.5.0 +ARROW_AZURE_STORAGE_BLOB_BUILD_SHA256_CHECKSUM=12394d864144ced9fc3562ad48cfe3426604e871b5aa72853ca398e086f0c594 +ARROW_AZURE_STORAGE_COMMON_BUILD_VERSION=12.2.4 +ARROW_AZURE_STORAGE_COMMON_BUILD_SHA256_CHECKSUM=7644b4355b492ba2039236b9fd56c3e7bb80aad983d8bac6a731d74aaf64e03f ARROW_AZURE_STORAGE_FILES_DATALAKE_BUILD_VERSION=12.3.1 ARROW_AZURE_STORAGE_FILES_DATALAKE_BUILD_SHA256_CHECKSUM=a5b74076a751d7cfaf7c56674a40ce2792c4fab9add18758fab1fe091d00baff ARROW_BOOST_BUILD_VERSION=1.75.0 — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

pitrou avatar Aug 10 '22 14:08 pitrou

OK. I'll wait for your review at least a few weeks.

kou avatar Aug 10 '22 20:08 kou

Jira Username - shefali636 If there are no further comments then can we merge this now?

shefali163 avatar Aug 29 '22 04:08 shefali163

@kou I'll definitely review this.

pitrou avatar Aug 29 '22 08:08 pitrou

@shefali163 Could you confirm Travis CI failures?

It seems that we need to install libxml2-dev:

diff --git a/ci/docker/ubuntu-18.04-cpp.dockerfile b/ci/docker/ubuntu-18.04-cpp.dockerfile
index 0e20b7c6a..745a12314 100644
--- a/ci/docker/ubuntu-18.04-cpp.dockerfile
+++ b/ci/docker/ubuntu-18.04-cpp.dockerfile
@@ -79,6 +79,7 @@ RUN apt-get update -y -q && \
         libre2-dev \
         libsnappy-dev \
         libssl-dev \
+        libxml2-dev \
         ninja-build \
         pkg-config \
         protobuf-compiler \
diff --git a/ci/docker/ubuntu-20.04-cpp.dockerfile b/ci/docker/ubuntu-20.04-cpp.dockerfile
index dd36aff84..9e8da8950 100644
--- a/ci/docker/ubuntu-20.04-cpp.dockerfile
+++ b/ci/docker/ubuntu-20.04-cpp.dockerfile
@@ -90,6 +90,7 @@ RUN apt-get update -y -q && \
         libssl-dev \
         libthrift-dev \
         libutf8proc-dev \
+        libxml2-dev \
         libzstd-dev \
         make \
         ninja-build \
diff --git a/ci/docker/ubuntu-22.04-cpp.dockerfile b/ci/docker/ubuntu-22.04-cpp.dockerfile
index 05aca5315..391315177 100644
--- a/ci/docker/ubuntu-22.04-cpp.dockerfile
+++ b/ci/docker/ubuntu-22.04-cpp.dockerfile
@@ -89,6 +89,7 @@ RUN apt-get update -y -q && \
         libsqlite3-dev \
         libthrift-dev \
         libutf8proc-dev \
+        libxml2-dev \
         libzstd-dev \
         make \
         ninja-build \

kou avatar Aug 29 '22 08:08 kou

In any case, I suggest we hold this up until the C++ version is switched to C++17 (see https://issues.apache.org/jira/browse/ARROW-17545).

pitrou avatar Aug 29 '22 13:08 pitrou

I'm out of my depths here, but based on https://issues.apache.org/jira/browse/ARROW-2034?focusedCommentId=17463318&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17463318 I thought it was OK to have some parts of the codebase require a newer C++ standard, while the overall codebase required C++11?

Though perhaps Antoine is suggesting that Arrow's close enough to requiring C++17 that the additional complications to the build process aren't worth the hassle? Do you think that https://github.com/apache/arrow/pull/13991/ will block this for a while?

TomAugspurger avatar Aug 29 '22 13:08 TomAugspurger

Though perhaps Antoine is suggesting that Arrow's close enough to requiring C++17 that the additional complications to the build process aren't worth the hassle?

Yes, exactly this :-)

Do you think that #13991 will block this for a while?

Well, ideally #13991 shouldn't take too much time, but of course a lot of CI has to pass before it's merged.

pitrou avatar Aug 29 '22 14:08 pitrou

@pitrou now that C++17 has been merged, can we unblock this?

nealrichardson avatar Sep 19 '22 13:09 nealrichardson

@nealrichardson Given the more pressing tasks for the 10.0.0 Arrow release, and the desire to not further disrupt the existing CI situation with the requirements of new third-party dependencies, I think this will unfortunately have to wait for 11.0.0.

pitrou avatar Oct 05 '22 09:10 pitrou

Hi- I am curious to know if this ADLS support will be included in the next release of the Arrow library in mid-January (version 11)? We need this feature. Thank you! Jerry

av8or1 avatar Jan 06 '23 19:01 av8or1

Hi again, We need this feature. Is there anything I could do to help bring it to the finish line? Would an additional developer resource be of benefit or just cloud the picture? Thanks!

av8or1 avatar Jan 10 '23 17:01 av8or1

We need to rebase on master, fix a conflict, fix CI failures and address reviews.

If we use step-by-step approach for Azure Blob Storage support like Google Cloud Storage support, we can merge this feature more quickly. See https://github.com/apache/arrow/commits/master/cpp/src/arrow/filesystem/gcsfs.cc for Google Cloud Storage support case.

kou avatar Jan 11 '23 08:01 kou

@shefali163 Any plan to update this?

wgtmac avatar Mar 23 '23 02:03 wgtmac

Hi wgtmac-

I have exchanged email with Weston recently regarding this feature. I volunteered to pick up the work and bring it across the finish line. If however you would prefer to do the work yourself, I would defer to you. I'm glad to do the work, do not misunderstand. I would be a new committer to Arrow though. I digress. Hope this helps.

av8or1 avatar Mar 23 '23 07:03 av8or1

If one of you works on this, could you use a step-by-step approach? Large pull request is difficult to review...

kou avatar Mar 23 '23 07:03 kou