GH-45747: [C++] Remove deprecated ObjectType and FileStatistics, refactor hdfs code
Rationale for this change
ObjectType and FileStatistics in io/hdfs.h have been deprecated for a while and can be removed.
What changes are included in this PR?
ObjectType and FileStatistics structs are removed and instead FileSystem API in arrow::fs is used. Together with this change, the hdfs connected code is moved from cpp/src/arrow/io to cpp/src/arrow/filesystem merging FileSystem and HadoopFileSystem classes from arrow::io into the public HadoopFileSystem class.
Are these changes tested?
Existing tests should pass.
Are there any user-facing changes?
Deprecated structs are removed and all hdfs related code is now a part of the filesystem module.
- GitHub Issue: #45747
Also closes: https://github.com/apache/arrow/issues/22457 (not sure about io/interfaces.h?)
Ah, this will not work. If we want to remove ObjectType and FileStatistics from io/hdfs.h and instead use FileSystem API, then filesystem component would have to be enabled by default. Also, if I understand correctly, we want to do a refactoring of io/hdfs anyways which would include the changes in this PR? (https://github.com/apache/arrow/issues/22457)
cc @pitrou
Ah, this will not work. If we want to remove
ObjectTypeandFileStatisticsfrom io/hdfs.h and instead use FileSystem API, then filesystem component would have to be enabled by default.
Well, ARROW_HDFS=ON could imply ARROW_FILESYSTEM=ON. I don't think that's a problem.
Also, if I understand correctly, we want to do a refactoring of io/hdfs anyways which would include the changes in this PR?
Yes, indeed. io/hdfs could be moved to filesystem/hdfs_internal or something similar.
OK, I will then move io/hdfs to filesystem/hdfs_internal. Thanks!
@pitrou I think this is ready for review. The failing builds have an issue opened: https://github.com/apache/arrow/issues/46077
Hi @pitrou, could you please take a quick look at the changes when you have a moment?
I've done my best to implement the suggested changes, but am sure there's still room for improvement. A couple of issues I could use your input on:
-
Some C++ tests are failing in
hdfs_test, with the following error:/arrow/cpp/src/arrow/filesystem/hdfs_test.cc:90: HadoopFileSystem::Make failed, it is possible when we don't have proper driver on this node, err msg is IOError: Unable to load libjvmI'm not sure how best to resolve this — any guidance would be appreciated.
-
The MSVC compiler is complaining about a forward-declared friend function I'm using in
hdfs.cc. Do you have any advice on how to better organise this?
The Python and MATLAB test failures are not related. Thanks in advance!
Hi @AlenkaF
- Some C++ tests are failing in
hdfs_test, with the following error:
I think you're misreading the output, the test is actually skipped when the driver fails unloading, which is normal:
(...)
dlopen(/usr/java/latest//lib/amd64/server/libjvm.so) failed: /usr/java/latest//lib/amd64/server/libjvm.so: cannot open shared object file: No such file or directory
/arrow/cpp/src/arrow/filesystem/hdfs_internal.cc:294 try_dlopen(libjvm_potential_paths, "libjvm")
/arrow/cpp/src/arrow/filesystem/hdfs.cc:95 ConnectLibHdfs(&driver_)
/arrow/cpp/src/arrow/filesystem/hdfs.cc:725 ptr->impl_->Init()
/arrow/cpp/src/arrow/filesystem/hdfs_test.cc:205: Skipped
Driver not loaded, skipping
https://github.com/apache/arrow/actions/runs/15109276550/job/42464862030?pr=45998#step:7:3277
The problem is in the other tests, because it seems a destructor crashes:
[ RUN ] TestHadoopFileSystem.DeleteDirContents
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Running '/build/cpp/debug/arrow-hdfs-test' produced core dump at '/tmp/core.arrow-hdfs-test.25630', printing backtrace:
[New LWP 25630]
[New LWP 25631]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/build/cpp/debug/arrow-hdfs-test'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00007fc37b23d5be in arrow::io::internal::LibHdfsShim::Disconnect () at /arrow/cpp/src/arrow/filesystem/hdfs_internal.cc:340
340 int LibHdfsShim::Disconnect(hdfsFS fs) { return this->hdfsDisconnect(fs); }
[Current thread is 1 (Thread 0x7fc374633dc0 (LWP 25630))]
(...)
https://github.com/apache/arrow/actions/runs/15109276550/job/42464862030?pr=45998#step:7:3281
- The MSVC compiler is complaining about a forward-declared friend function I'm using in
hdfs.cc. Do you have any advice on how to better organise this?
Hmm, rather than trying to find the exact explanation, a simple solution would be to change these functions into static methods, for example this:
ARROW_EXPORT Status MakeReadableFile(const std::string& path, int32_t buffer_size,
const io::IOContext& io_context, LibHdfsShim* driver,
hdfsFS fs, hdfsFile file,
std::shared_ptr<HdfsReadableFile>* out);
would become:
class ARROW_EXPORT HdfsReadableFile : public RandomAccessFile {
public:
(...)
static Result<std::shared_ptr<HdfsReadableFile>> Make(
const std::string& path, int32_t buffer_size,
const io::IOContext& io_context, LibHdfsShim* driver,
hdfsFS fs, hdfsFile file);
Aha, I see! Thanks, will look into it.
@pitrou I cleaned up the CI failures (others are not related) and am hoping this changes will not be too bad to review :)
@benibus @pitrou would you mind taking another look?
@pitrou gentle ping. Would I be too optimistic to try to get it into 21.0.0?