velox
velox copied to clipboard
Support jvm version libhdfs in velox
Currently, Gluten will throw hdfs connection failures when executing queries if their HDFS system employs Kerberos authentication and Viewfs support. This is due to the fact that the existing libhdfs3 API does not support Kerberos authentication, whereas the JVM version of libhdfs is capable of invoking APIs that support Kerberos authentication. If the user's system has the HADOOP_HOME environment variable set, the JVM version of libhdfs will be used during the compilation of Gluten; if not set, the default libhdfs3 will be used instead.
Deploy Preview for meta-velox canceled.
| Name | Link |
|---|---|
| Latest commit | 9c058b0ba79a0cb985732e2a8b79d0f86363e9b0 |
| Latest deploy log | https://app.netlify.com/sites/meta-velox/deploys/67184d321be84b0008529d2a |
@mbasmanova @majetideepak Can you help to review? Thanks.
@JkSelf CI is red. Please, take a look.
The C++ libhdfs3 is unmaintained (no commits in 3+ years, hawq is moving into the attic (asf 'archive')) so I think this should replace it entirely! @pedroerp @kgpai we talked about this before, I think this is what arrow does as well but I haven't had a look yet.
cc: @majetideepak
It seems Arrow's HDFS filesystem depends on some hdfs dynamic library to be present in the system and dynamically links against it. I see this PR does something similar as well. https://github.com/apache/arrow/blob/main/cpp/src/arrow/io/hdfs_internal.cc#L114
get_potential_libhdfs_paths()
get_potential_libjvm_paths()
We do dynamically link against LIBHDFS3(hawk) as well. It should not hurt to support it? We should avoid the multiple ifdefs in this PR.
It should not hurt to support it?
Well it is unmaintained so it should'nt be the default and we should probably mark it deprecated and remove it in the future.
Well it is unmaintained so it should'nt be the default and we should probably mark it deprecated and remove it in the future.
Switching the default makes sense since hawq is not maintained. But I wonder if the hawq implementation provides additional benefits (say performance) compared to the Java implementation.
It seems Arrow's HDFS filesystem depends on some hdfs dynamic library to be present in the system and dynamically links against it. I see this PR does something similar as well. https://github.com/apache/arrow/blob/main/cpp/src/arrow/io/hdfs_internal.cc#L114
get_potential_libhdfs_paths() get_potential_libjvm_paths()
@majetideepak Yes, that makes sense to me. If we proceed this way, we can avoid the need to ensure that the user's system has HDFS and JVM installed during the build phase.
Well it is unmaintained so it should'nt be the default and we should probably mark it deprecated and remove it in the future.
Switching the default makes sense since hawq is not maintained. But I wonder if the hawq implementation provides additional benefits (say performance) compared to the Java implementation.
-
In terms of performance, hawq implementation is generally considered to be superior to jvm libhdfs because it is designed for enhanced performance and reduces the additional overhead caused by JNI. However, functionality is equally important to consider. Currently, hawq implementation does not meet user needs in certain functional aspects, such as lacking support for Kerberos authentication and ViewFs.
-
If there is a preference to retain hawq implementation for its performance, we could offer two compilation options, VELOX_ENABLE_HDFS and VELOX_ENABLE_HDFS3, to support jvm libhdfs and hawq implementation respectively. However, maintaining both options might necessitate the use of multiple ifdef statements to determine which hdfs class to load. @majetideepak @assignUser What do you think? Thanks.
In terms of performance, hawq implementation is generally considered to be superior t
While better performance is of course great, in this case it comes with the considerable drawback that the library in question has been unmaintained for several years and the parent project is currently in the process of being 'archived' as well. My vote would be to remove it entirely but I am not a user so I'll let others decide :)
@assignUser Agree to remove hawa implementation entirely. @majetideepak What do you think?
Agree to remove hawa implementation entirely.
I am not a user as well :). Let's open a discussion on this and get input from the wider community.
@assignUser @majetideepak Thank you for your response. I have initiated a poll in the community at https://github.com/facebookincubator/velox/discussions/9969. Your participation and vote would be greatly appreciated.
@JkSelf can you resolve the conflict?
@JkSelf Thank you for your work. Is there any progress? It would be a really nice feature for me to replace with libhdfs so that we can support more file systems based on Hadoop FileSystem.
@JkSelf Thank you for your work. Is there any progress? It would be a really nice feature for me to replace with libhdfs so that we can support more file systems based on Hadoop FileSystem.
@wForget Pending https://github.com/facebookincubator/velox/pull/10446 to merge firstly.
@wForget Pending #10446 to merge firstly.
Thanks, I have backported this PR to the internal velox repo and it works fine. And after relaxing isHdfsFile check, we can access the custom filesystem based on Hadoop FileSystem.
https://github.com/facebookincubator/velox/blob/c7e26370fd6d4daf6121bb169a16df0d3cec576a/velox/connectors/hive/storage_adapters/hdfs/HdfsFileSystem.cpp#L106-L108
@majetideepak @assignUser @xiaoxmeng Can you help to review this PR? Thanks.
Also I thought the issues with CI should have been fixed, maybe a rebase is needed?
@assignUser The InsertIntoHdfsTest is failing with a core dump in libhdfs.so when executed alongside the velox_hdfs_file test suite.
[ RUN ] InsertIntoHdfsTest.insertIntoHdfsTest
I20240911 07:21:56.337330 59114 PeriodicStatsReporter.cpp:69] Starting PeriodicStatsReporter with options allocatorStatsIntervalMs:2000, cacheStatsIntervalMs:2000, arbitratorStatsIntervalMs:2000, spillStatsIntervalMs:2000
I20240911 07:21:56.337757 149694 PeriodicStatsReporter.cpp:252] Spill memory usage: current[0B] peak[0B]
I20240911 07:21:56.338176 59114 HiveConnector.cpp:60] Hive connector test-hive created with maximum of 20000 cached file handles.
WARNING: HADOOP_PREFIX has been replaced by HADOOP_HOME. Using value of HADOOP_PREFIX.
I20240911 07:21:58.337890 149694 PeriodicStatsReporter.cpp:252] Spill memory usage: current[0B] peak[0B]
I20240911 07:22:00.338016 149694 PeriodicStatsReporter.cpp:252] Spill memory usage: current[0B] peak[0B]
I20240911 07:22:02.338126 149694 PeriodicStatsReporter.cpp:252] Spill memory usage: current[0B] peak[0B]
I20240911 07:22:04.338235 149694 PeriodicStatsReporter.cpp:252] Spill memory usage: current[0B] peak[0B]
I20240911 07:22:06.338351 149694 PeriodicStatsReporter.cpp:252] Spill memory usage: current[0B] peak[0B]
I20240911 07:22:08.338472 149694 PeriodicStatsReporter.cpp:252] Spill memory usage: current[0B] peak[0B]
I20240911 07:22:10.338593 149694 PeriodicStatsReporter.cpp:252] Spill memory usage: current[0B] peak[0B]
I20240911 07:22:12.338698 149694 PeriodicStatsReporter.cpp:252] Spill memory usage: current[0B] peak[0B]
I20240911 07:22:14.338814 149694 PeriodicStatsReporter.cpp:252] Spill memory usage: current[0B] peak[0B]
I20240911 07:22:16.338929 149694 PeriodicStatsReporter.cpp:252] Spill memory usage: current[0B] peak[0B]
I20240911 07:22:18.339048 149694 PeriodicStatsReporter.cpp:252] Spill memory usage: current[0B] peak[0B]
I20240911 07:22:20.339174 149694 PeriodicStatsReporter.cpp:252] Spill memory usage: current[0B] peak[0B]
I20240911 07:22:22.339289 149694 PeriodicStatsReporter.cpp:252] Spill memory usage: current[0B] peak[0B]
I20240911 07:22:24.339403 149694 PeriodicStatsReporter.cpp:252] Spill memory usage: current[0B] peak[0B]
I20240911 07:22:26.339527 149694 PeriodicStatsReporter.cpp:252] Spill memory usage: current[0B] peak[0B]
I20240911 07:22:28.339653 149694 PeriodicStatsReporter.cpp:252] Spill memory usage: current[0B] peak[0B]
I20240911 07:22:30.339766 149694 PeriodicStatsReporter.cpp:252] Spill memory usage: current[0B] peak[0B]
I20240911 07:22:32.339887 149694 PeriodicStatsReporter.cpp:252] Spill memory usage: current[0B] peak[0B]
I20240911 07:22:34.340042 149694 PeriodicStatsReporter.cpp:252] Spill memory usage: current[0B] peak[0B]
I20240911 07:22:36.340207 149694 PeriodicStatsReporter.cpp:252] Spill memory usage: current[0B] peak[0B]
I20240911 07:22:38.340332 149694 PeriodicStatsReporter.cpp:252] Spill memory usage: current[0B] peak[0B]
I20240911 07:22:40.340471 149694 PeriodicStatsReporter.cpp:252] Spill memory usage: current[0B] peak[0B]
I20240911 07:22:42.340610 149694 PeriodicStatsReporter.cpp:252] Spill memory usage: current[0B] peak[0B]
I20240911 07:22:44.340759 149694 PeriodicStatsReporter.cpp:252] Spill memory usage: current[0B] peak[0B]
I20240911 07:22:46.340917 149694 PeriodicStatsReporter.cpp:252] Spill memory usage: current[0B] peak[0B]
I20240911 07:22:48.341060 149694 PeriodicStatsReporter.cpp:252] Spill memory usage: current[0B] peak[0B]
I20240911 07:22:50.341228 149694 PeriodicStatsReporter.cpp:252] Spill memory usage: current[0B] peak[0B]
I20240911 07:22:52.341403 149694 PeriodicStatsReporter.cpp:252] Spill memory usage: current[0B] peak[0B]
I20240911 07:22:54.341535 149694 PeriodicStatsReporter.cpp:252] Spill memory usage: current[0B] peak[0B]
I20240911 07:22:56.341706 149694 PeriodicStatsReporter.cpp:252] Spill memory usage: current[0B] peak[0B]
I20240911 07:22:56.401234 181261 FileSink.cpp:130] closing file: hdfs://localhost:7878/test_cursor 1_0_1_5187172a-54af-495c-8d45-a770f5b83638, total size: 14.42KB
I20240911 07:22:56.878257 181261 Task.cpp:1158] All drivers (1) finished for task test_cursor 1 after running for 530ms
I20240911 07:22:56.878322 181261 Task.cpp:1[923](https://github.com/facebookincubator/velox/actions/runs/10806718051/job/29976080424?pr=9835#step:11:924)] Terminating task test_cursor 1 with state Finished after running for 530ms
I20240911 07:22:56.913267 181272 Task.cpp:1158] All drivers (1) finished for task test_cursor 2 after running for 31ms
I20240911 07:22:56.913326 181272 Task.cpp:1923] Terminating task test_cursor 2 with state Finished after running for 31ms
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00007fb89331c4c8, pid=59114, tid=0x00007fb85bfff640
#
# JRE version: OpenJDK Runtime Environment (8.0_362-b08) (build 1.8.0_362-b08)
# Java VM: OpenJDK 64-Bit Server VM (25.362-b08 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C [libhdfs.so+0x34c8] methodIdFromClass.part.0+0x88
#
# Core dump written. Default location: /__w/velox/velox/_build/release/velox/connectors/hive/storage_adapters/hdfs/tests/core or core.59114
#
# An error report file with more information is saved as:
# /__w/velox/velox/_build/release/velox/connectors/hive/storage_adapters/hdfs/tests/hs_err_pid59114.log
#
# If you would like to submit a bug report, please visit:
# https://bugzilla.redhat.com/enter_bug.cgi?product=Red%20Hat%20Enterprise%20Linux%207&component=java-1.8.0-openjdk
#
However, the test passes when run in isolation. It seems there is a potential dependency issue between HdfsFileSystemTest and InsertIntoHdfsTest. Could you provide any insights or suggestions on this matter?
However, the test passes when run in isolation. It seems there is a potential dependency issue between HdfsFileSystemTest and InsertIntoHdfsTest. Could you provide any insights or suggestions on this matter?
Not really sorry, I am not familiar with the adapters code nor hdfs.
@assignUser @majetideepak @kgpai Can you help to review this PR again? Thanks.
@majetideepak Can you help to review this PR? Thanks.
@JkSelf I will take a look. Please give me another day. Thanks.
@majetideepak Can you help to review again? Thanks.
@majetideepak Resolved all your comments. Can you help to review again? Thanks.
@majetideepak I resolved all your comments. Can you help to review again? Thanks.
@majetideepak Thanks for your review. I have resolved all your comments. Can you help to review again. Thanks.