arrow
arrow copied to clipboard
ARROW-2034: [C++] Filesystem implementation for Azure Blob Storage
https://issues.apache.org/jira/browse/ARROW-2034
:warning: Ticket has not been started in JIRA, please click 'Start Progress'.
@github-actions autotune
@github-actions rebase
Hi @kou, Currently ARROW_AZURE
is turned OFF for all the PR builds, and azurefs.cc/azurefs_mock.cc
is not being included in the target, For which particular builds we should turn this ON?
Could you enable it in .github/workflows/cpp.yml
(Linux, macOS and Windows) and ci/docker/ubuntu-*-cpp.dockerfile
?
Rebase should fix the R build failures, though there are a couple of merge conflicts to resolve
We were going to add Azurite for testing AzureFileSystem
but while using this we faced some issues(similar to https://github.com/Azure/Azurite/issues/1101), We are currently working on resolving this, MockAzureFileSystem
is a temporary way of testing AzureFileSystem
as it copies the logic of the APIs which provides some level of testing, but we are planning to add Azurite in the upcoming PRs, Let me know if you think MockAzureFileSystem
won't work for this PR. @kou
Does Azurite have issues for all operations or only some operations? If it's the latter, how about enabling only tests that use stable operations?
I don't think that duplicating codes for testing with AzureFileSystem
and MockAzureFileSystem
is a good idea even if it's just a temporary solution... For example, copy problems may not be found by review...
Azurite has issues with the SDK clients and some endpoints like GetProperties
etc. which are used in the majority of operations, If MockAzureFileSystem
cannot be used then I will check if there are any stable tests and will work on resolving the issues with Azurite.
Hi @kou, I have added an install script for azurite but this seems to fail with a permission denied error(log), do we need to add this script somewhere else as well?
Also, the windows build fails with a linker error(log), Are we not supposed to use azurefs_objlib
with WIN32
? Ref: https://github.com/apache/arrow/blob/master/cpp/cmake_modules/BuildUtils.cmake#L205
I have added an install script for azurite but this seems to fail with a permission denied error(log), do we need to add this script somewhere else as well?
chmod +x ci/scripts/install_azurite.sh && git add ci/scripts/install_azurite.sh && git commit
will resolve it.
Also, the windows build fails with a linker error(log), Are we not supposed to use
azurefs_objlib
withWIN32
? Ref: https://github.com/apache/arrow/blob/master/cpp/cmake_modules/BuildUtils.cmake#L205
This isn't related to object library. It's caused by building Azure SDK for C++. It's a problem of Azure SDK for C++. https://github.com/Azure/azure-sdk-for-cpp/blob/80cf3d09fdebbf4a427a0a8b89d054250bb48ec9/sdk/storage/azure-storage-common/CMakeLists.txt#L98 should be executed only in if(MSVC)
not if(WIN32)
. Could you report this to Azure SDK for C++?
(Note that we should not change https://github.com/Azure/azure-sdk-for-cpp/blob/80cf3d09fdebbf4a427a0a8b89d054250bb48ec9/sdk/storage/azure-storage-common/CMakeLists.txt#L95 to if(MSVC)
because there is one more statement: https://github.com/Azure/azure-sdk-for-cpp/blob/80cf3d09fdebbf4a427a0a8b89d054250bb48ec9/sdk/storage/azure-storage-common/CMakeLists.txt#L96 )
Keeping ARROW_AZURE
: OFF
in windows-mingw
build for now
Keeping
ARROW_AZURE
:OFF
inwindows-mingw
build for now
OK. But could you report this problem to upstream to enable on MinGW in the future?
I'd like to give this a review before it gets merged, if possible.
Le 10 août 2022 10:28:06 Sutou Kouhei @.***> a écrit :
@kou commented on this pull request. Could you rebase on master to resolve conflicts? Could you tell me your account on Jira? I want to assign you to https://issues.apache.org/jira/browse/ARROW-2034 before we merge this. Honestly, I can't review all changes carefully in this pull request because this is too large for me. I'll merge this after we address all comments from me. Problems I missed in this pull request or newly found problems after we merge can be fixed by follow-up pull requests.
In cpp/thirdparty/versions.txt:
+ARROW_AZURE_STORAGE_BLOB_BUILD_VERSION=12.4.0 +ARROW_AZURE_STORAGE_BLOB_BUILD_SHA256_CHECKSUM=ce77055ff5e1b88826a89a29399ffbdcdc77beca1eae61c81f34a3f6e0a20715 ⬇️ Suggested change -ARROW_AZURE_STORAGE_BLOB_BUILD_VERSION=12.4.0 -ARROW_AZURE_STORAGE_BLOB_BUILD_SHA256_CHECKSUM=ce77055ff5e1b88826a89a29399ffbdcdc77beca1eae61c81f34a3f6e0a20715 +ARROW_AZURE_STORAGE_BLOBS_BUILD_VERSION=12.4.0 +ARROW_AZURE_STORAGE_BLOBS_BUILD_SHA256_CHECKSUM=ce77055ff5e1b88826a89a29399ffbdcdc77beca1eae61c81f34a3f6e0a20715
In cpp/src/arrow/filesystem/azurefs.cc:
- if (this->is_azurite) {
- account_blob_url = "http://127.0.0.1:10000/" + account_name + "/";
- account_dfs_url = "http://127.0.0.1:10000/" + account_name + "/";
- } else {
- account_dfs_url = "https://" + account_name + ".dfs.core.windows.net/";
- account_blob_url = "https://" + account_name + ".blob.core.windows.net/";
- }
- storage_credentials_provider =
- std::make_sharedAzure::Storage::StorageSharedKeyCredential(account_name,
- account_key);
- credentials_kind = AzureCredentialsKind::StorageCredentials;
- return Status::OK(); +}
+Status AzureOptions::ConfigureConnectionStringCredentials(
- const std::string& connection_string_uri) { It seems that connection string isn't an URI: https://docs.microsoft.com/en-us/azure/storage/common/storage-configure-connection-string
In cpp/src/arrow/filesystem/azurefs.cc:
- // Ensure container exists
- ARROW_ASSIGN_OR_RAISE(bool container_exists, impl_->ContainerExists(path.container));
- if (!container_exists) {
- RETURN_NOT_OK(impl_->CreateContainer(path.container));
- }
- std::vectorstd::string parent_path_to_file;
- for (const auto& part : path.path_to_file_parts) {
- parent_path_to_file.push_back(part);
- RETURN_NOT_OK(impl_->CreateEmptyDir(path.container, parent_path_to_file));
- }
- return Status::OK();
- } else {
- // Check parent dir exists
- if (path.has_parent()) {
- AzurePath parent_path = path.parent(); ⬇️ Suggested change
- AzurePath parent_path = path.parent();
- auto parent_path = path.parent();
In cpp/thirdparty/versions.txt:
@@ -33,6 +33,16 @@ ARROW_AWS_C_COMMON_BUILD_VERSION=v0.6.9 ARROW_AWS_C_COMMON_BUILD_SHA256_CHECKSUM=928a3e36f24d1ee46f9eec360ec5cebfe8b9b8994fe39d4fa74ff51aebb12717 ARROW_AWS_C_EVENT_STREAM_BUILD_VERSION=v0.1.5 ARROW_AWS_C_EVENT_STREAM_BUILD_SHA256_CHECKSUM=f1b423a487b5d6dca118bfc0d0c6cc596dc476b282258a3228e73a8f730422d4 +ARROW_AZURE_CORE_BUILD_VERSION=1.5.0 +ARROW_AZURE_CORE_BUILD_SHA256_CHECKSUM=dab2caa54d062b61dbe982e29a4f1fcc70216b51b038a807763712a40dd258e9 +ARROW_AZURE_IDENTITY_BUILD_VERSION=1.2.0 +ARROW_AZURE_IDENTITY_BUILD_SHA256_CHECKSUM=ad4702890c25f956c59a63be4571a08ae0690fa6d2bfbebf326d0fd2e9b72945 +ARROW_AZURE_STORAGE_BLOB_BUILD_VERSION=12.4.0 +ARROW_AZURE_STORAGE_BLOB_BUILD_SHA256_CHECKSUM=ce77055ff5e1b88826a89a29399ffbdcdc77beca1eae61c81f34a3f6e0a20715 +ARROW_AZURE_STORAGE_COMMON_BUILD_VERSION=12.2.3 +ARROW_AZURE_STORAGE_COMMON_BUILD_SHA256_CHECKSUM=2d58e9c314b1b32f7d09880239a4ecce6686ed6df236a58f681ae5d526ed6201 +ARROW_AZURE_STORAGE_FILES_DATALAKE_BUILD_VERSION=12.3.1 +ARROW_AZURE_STORAGE_FILES_DATALAKE_BUILD_SHA256_CHECKSUM=a5b74076a751d7cfaf7c56674a40ce2792c4fab9add18758fab1fe091d00baff Could you update to the latest versions? I got segmentation fault with these versions on my Debian GNU/Linux sid environment. diff --git a/cpp/cmake_modules/ThirdpartyToolchain.cmake b/cpp/cmake_modules/ThirdpartyToolchain.cmake index 3a555b0c4b..73d582a7a9 100644 --- a/cpp/cmake_modules/ThirdpartyToolchain.cmake +++ b/cpp/cmake_modules/ThirdpartyToolchain.cmake @@ -4841,14 +4841,15 @@ macro(build_azuresdk) set(AZURESDK_COMMON_CMAKE_ARGS ${EP_COMMON_CMAKE_ARGS}
- "-DCMAKE_INSTALL_PREFIX=${AZURESDK_PREFIX}"
- "-DCMAKE_PREFIX_PATH=${AZURESDK_PREFIX}" -DBUILD_SHARED_LIBS=OFF -DCMAKE_INSTALL_LIBDIR=${AZURESDK_LIB_DIR}
- -DDISABLE_AZURE_CORE_OPENTELEMETRY=ON -DENABLE_TESTING=OFF -DENABLE_UNITY_BUILD=ON -DOPENSSL_ROOT_DIR=${OPENSSL_ROOT_HINT}
- -DWARNINGS_AS_ERRORS=OFF
- "-DCMAKE_INSTALL_PREFIX=${AZURESDK_PREFIX}"
- "-DCMAKE_PREFIX_PATH=${AZURESDK_PREFIX}")
- -DWARNINGS_AS_ERRORS=OFF) file(MAKE_DIRECTORY ${AZURESDK_INCLUDE_DIR}) diff --git a/cpp/thirdparty/versions.txt b/cpp/thirdparty/versions.txt index a8797de94b..1f21385a78 100644 --- a/cpp/thirdparty/versions.txt +++ b/cpp/thirdparty/versions.txt @@ -33,14 +33,14 @@ ARROW_AWS_C_COMMON_BUILD_VERSION=v0.6.9 ARROW_AWS_C_COMMON_BUILD_SHA256_CHECKSUM=928a3e36f24d1ee46f9eec360ec5cebfe8b9b8994fe39d4fa74ff51aebb12717 ARROW_AWS_C_EVENT_STREAM_BUILD_VERSION=v0.1.5 ARROW_AWS_C_EVENT_STREAM_BUILD_SHA256_CHECKSUM=f1b423a487b5d6dca118bfc0d0c6cc596dc476b282258a3228e73a8f730422d4 -ARROW_AZURE_CORE_BUILD_VERSION=1.5.0 -ARROW_AZURE_CORE_BUILD_SHA256_CHECKSUM=dab2caa54d062b61dbe982e29a4f1fcc70216b51b038a807763712a40dd258e9 -ARROW_AZURE_IDENTITY_BUILD_VERSION=1.2.0 -ARROW_AZURE_IDENTITY_BUILD_SHA256_CHECKSUM=ad4702890c25f956c59a63be4571a08ae0690fa6d2bfbebf326d0fd2e9b72945 -ARROW_AZURE_STORAGE_BLOB_BUILD_VERSION=12.4.0 -ARROW_AZURE_STORAGE_BLOB_BUILD_SHA256_CHECKSUM=ce77055ff5e1b88826a89a29399ffbdcdc77beca1eae61c81f34a3f6e0a20715 -ARROW_AZURE_STORAGE_COMMON_BUILD_VERSION=12.2.3 -ARROW_AZURE_STORAGE_COMMON_BUILD_SHA256_CHECKSUM=2d58e9c314b1b32f7d09880239a4ecce6686ed6df236a58f681ae5d526ed6201 +ARROW_AZURE_CORE_BUILD_VERSION=1.7.1 +ARROW_AZURE_CORE_BUILD_SHA256_CHECKSUM=ae6f03e65d9773d11cf3b9619d0bc7f567272974cf31b9e1c8ca2fa0ea4fb4c6 +ARROW_AZURE_IDENTITY_BUILD_VERSION=1.3.0 +ARROW_AZURE_IDENTITY_BUILD_SHA256_CHECKSUM=46701acd8000f317d1c4b33263d5d3203924fadcfa5af4860ae9187046a72c45 +ARROW_AZURE_STORAGE_BLOB_BUILD_VERSION=12.5.0 +ARROW_AZURE_STORAGE_BLOB_BUILD_SHA256_CHECKSUM=12394d864144ced9fc3562ad48cfe3426604e871b5aa72853ca398e086f0c594 +ARROW_AZURE_STORAGE_COMMON_BUILD_VERSION=12.2.4 +ARROW_AZURE_STORAGE_COMMON_BUILD_SHA256_CHECKSUM=7644b4355b492ba2039236b9fd56c3e7bb80aad983d8bac6a731d74aaf64e03f ARROW_AZURE_STORAGE_FILES_DATALAKE_BUILD_VERSION=12.3.1 ARROW_AZURE_STORAGE_FILES_DATALAKE_BUILD_SHA256_CHECKSUM=a5b74076a751d7cfaf7c56674a40ce2792c4fab9add18758fab1fe091d00baff ARROW_BOOST_BUILD_VERSION=1.75.0 — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>
OK. I'll wait for your review at least a few weeks.
Jira Username - shefali636 If there are no further comments then can we merge this now?
@kou I'll definitely review this.
@shefali163 Could you confirm Travis CI failures?
It seems that we need to install libxml2-dev
:
diff --git a/ci/docker/ubuntu-18.04-cpp.dockerfile b/ci/docker/ubuntu-18.04-cpp.dockerfile
index 0e20b7c6a..745a12314 100644
--- a/ci/docker/ubuntu-18.04-cpp.dockerfile
+++ b/ci/docker/ubuntu-18.04-cpp.dockerfile
@@ -79,6 +79,7 @@ RUN apt-get update -y -q && \
libre2-dev \
libsnappy-dev \
libssl-dev \
+ libxml2-dev \
ninja-build \
pkg-config \
protobuf-compiler \
diff --git a/ci/docker/ubuntu-20.04-cpp.dockerfile b/ci/docker/ubuntu-20.04-cpp.dockerfile
index dd36aff84..9e8da8950 100644
--- a/ci/docker/ubuntu-20.04-cpp.dockerfile
+++ b/ci/docker/ubuntu-20.04-cpp.dockerfile
@@ -90,6 +90,7 @@ RUN apt-get update -y -q && \
libssl-dev \
libthrift-dev \
libutf8proc-dev \
+ libxml2-dev \
libzstd-dev \
make \
ninja-build \
diff --git a/ci/docker/ubuntu-22.04-cpp.dockerfile b/ci/docker/ubuntu-22.04-cpp.dockerfile
index 05aca5315..391315177 100644
--- a/ci/docker/ubuntu-22.04-cpp.dockerfile
+++ b/ci/docker/ubuntu-22.04-cpp.dockerfile
@@ -89,6 +89,7 @@ RUN apt-get update -y -q && \
libsqlite3-dev \
libthrift-dev \
libutf8proc-dev \
+ libxml2-dev \
libzstd-dev \
make \
ninja-build \
In any case, I suggest we hold this up until the C++ version is switched to C++17 (see https://issues.apache.org/jira/browse/ARROW-17545).
I'm out of my depths here, but based on https://issues.apache.org/jira/browse/ARROW-2034?focusedCommentId=17463318&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17463318 I thought it was OK to have some parts of the codebase require a newer C++ standard, while the overall codebase required C++11?
Though perhaps Antoine is suggesting that Arrow's close enough to requiring C++17 that the additional complications to the build process aren't worth the hassle? Do you think that https://github.com/apache/arrow/pull/13991/ will block this for a while?
Though perhaps Antoine is suggesting that Arrow's close enough to requiring C++17 that the additional complications to the build process aren't worth the hassle?
Yes, exactly this :-)
Do you think that #13991 will block this for a while?
Well, ideally #13991 shouldn't take too much time, but of course a lot of CI has to pass before it's merged.
@pitrou now that C++17 has been merged, can we unblock this?
@nealrichardson Given the more pressing tasks for the 10.0.0 Arrow release, and the desire to not further disrupt the existing CI situation with the requirements of new third-party dependencies, I think this will unfortunately have to wait for 11.0.0.
Hi- I am curious to know if this ADLS support will be included in the next release of the Arrow library in mid-January (version 11)? We need this feature. Thank you! Jerry
Hi again, We need this feature. Is there anything I could do to help bring it to the finish line? Would an additional developer resource be of benefit or just cloud the picture? Thanks!
We need to rebase on master, fix a conflict, fix CI failures and address reviews.
If we use step-by-step approach for Azure Blob Storage support like Google Cloud Storage support, we can merge this feature more quickly. See https://github.com/apache/arrow/commits/master/cpp/src/arrow/filesystem/gcsfs.cc for Google Cloud Storage support case.
@shefali163 Any plan to update this?
Hi wgtmac-
I have exchanged email with Weston recently regarding this feature. I volunteered to pick up the work and bring it across the finish line. If however you would prefer to do the work yourself, I would defer to you. I'm glad to do the work, do not misunderstand. I would be a new committer to Arrow though. I digress. Hope this helps.
If one of you works on this, could you use a step-by-step approach? Large pull request is difficult to review...