arrow icon indicating copy to clipboard operation
arrow copied to clipboard

GH-45747: [C++] Remove deprecated ObjectType and FileStatistics, refactor hdfs code

Open AlenkaF opened this issue 8 months ago • 11 comments

Rationale for this change

ObjectType and FileStatistics in io/hdfs.h have been deprecated for a while and can be removed.

What changes are included in this PR?

ObjectType and FileStatistics structs are removed and instead FileSystem API in arrow::fs is used. Together with this change, the hdfs connected code is moved from cpp/src/arrow/io to cpp/src/arrow/filesystem merging FileSystem and HadoopFileSystem classes from arrow::io into the public HadoopFileSystem class.

Are these changes tested?

Existing tests should pass.

Are there any user-facing changes?

Deprecated structs are removed and all hdfs related code is now a part of the filesystem module.

  • GitHub Issue: #45747

Also closes: https://github.com/apache/arrow/issues/22457 (not sure about io/interfaces.h?)

AlenkaF avatar Apr 01 '25 10:04 AlenkaF

Ah, this will not work. If we want to remove ObjectType and FileStatistics from io/hdfs.h and instead use FileSystem API, then filesystem component would have to be enabled by default. Also, if I understand correctly, we want to do a refactoring of io/hdfs anyways which would include the changes in this PR? (https://github.com/apache/arrow/issues/22457)

cc @pitrou

AlenkaF avatar Apr 01 '25 12:04 AlenkaF

Ah, this will not work. If we want to remove ObjectType and FileStatistics from io/hdfs.h and instead use FileSystem API, then filesystem component would have to be enabled by default.

Well, ARROW_HDFS=ON could imply ARROW_FILESYSTEM=ON. I don't think that's a problem.

Also, if I understand correctly, we want to do a refactoring of io/hdfs anyways which would include the changes in this PR?

Yes, indeed. io/hdfs could be moved to filesystem/hdfs_internal or something similar.

pitrou avatar Apr 02 '25 10:04 pitrou

OK, I will then move io/hdfs to filesystem/hdfs_internal. Thanks!

AlenkaF avatar Apr 02 '25 11:04 AlenkaF

@pitrou I think this is ready for review. The failing builds have an issue opened: https://github.com/apache/arrow/issues/46077

AlenkaF avatar Apr 09 '25 11:04 AlenkaF

Hi @pitrou, could you please take a quick look at the changes when you have a moment?

I've done my best to implement the suggested changes, but am sure there's still room for improvement. A couple of issues I could use your input on:

  • Some C++ tests are failing in hdfs_test, with the following error:

    /arrow/cpp/src/arrow/filesystem/hdfs_test.cc:90: HadoopFileSystem::Make failed, it is possible when we don't have proper 
    driver on this node, err msg is IOError: Unable to load libjvm
    

    I'm not sure how best to resolve this — any guidance would be appreciated.

  • The MSVC compiler is complaining about a forward-declared friend function I'm using in hdfs.cc. Do you have any advice on how to better organise this?

The Python and MATLAB test failures are not related. Thanks in advance!

AlenkaF avatar May 19 '25 09:05 AlenkaF

Hi @AlenkaF

  • Some C++ tests are failing in hdfs_test, with the following error:

I think you're misreading the output, the test is actually skipped when the driver fails unloading, which is normal:

(...)
dlopen(/usr/java/latest//lib/amd64/server/libjvm.so) failed: /usr/java/latest//lib/amd64/server/libjvm.so: cannot open shared object file: No such file or directory
/arrow/cpp/src/arrow/filesystem/hdfs_internal.cc:294  try_dlopen(libjvm_potential_paths, "libjvm")
/arrow/cpp/src/arrow/filesystem/hdfs.cc:95  ConnectLibHdfs(&driver_)
/arrow/cpp/src/arrow/filesystem/hdfs.cc:725  ptr->impl_->Init()
/arrow/cpp/src/arrow/filesystem/hdfs_test.cc:205: Skipped
Driver not loaded, skipping

https://github.com/apache/arrow/actions/runs/15109276550/job/42464862030?pr=45998#step:7:3277

The problem is in the other tests, because it seems a destructor crashes:


[ RUN      ] TestHadoopFileSystem.DeleteDirContents
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Running '/build/cpp/debug/arrow-hdfs-test' produced core dump at '/tmp/core.arrow-hdfs-test.25630', printing backtrace:
[New LWP 25630]
[New LWP 25631]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/build/cpp/debug/arrow-hdfs-test'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007fc37b23d5be in arrow::io::internal::LibHdfsShim::Disconnect () at /arrow/cpp/src/arrow/filesystem/hdfs_internal.cc:340
340	int LibHdfsShim::Disconnect(hdfsFS fs) { return this->hdfsDisconnect(fs); }
[Current thread is 1 (Thread 0x7fc374633dc0 (LWP 25630))]
(...)

https://github.com/apache/arrow/actions/runs/15109276550/job/42464862030?pr=45998#step:7:3281

pitrou avatar May 19 '25 12:05 pitrou

  • The MSVC compiler is complaining about a forward-declared friend function I'm using in hdfs.cc. Do you have any advice on how to better organise this?

Hmm, rather than trying to find the exact explanation, a simple solution would be to change these functions into static methods, for example this:

ARROW_EXPORT Status MakeReadableFile(const std::string& path, int32_t buffer_size,
  const io::IOContext& io_context, LibHdfsShim* driver,
  hdfsFS fs, hdfsFile file,
  std::shared_ptr<HdfsReadableFile>* out);

would become:

class ARROW_EXPORT HdfsReadableFile : public RandomAccessFile {
 public:
   (...)

  static Result<std::shared_ptr<HdfsReadableFile>> Make(
      const std::string& path, int32_t buffer_size,
      const io::IOContext& io_context, LibHdfsShim* driver,
      hdfsFS fs, hdfsFile file);

pitrou avatar May 19 '25 12:05 pitrou

Aha, I see! Thanks, will look into it.

AlenkaF avatar May 19 '25 12:05 AlenkaF

@pitrou I cleaned up the CI failures (others are not related) and am hoping this changes will not be too bad to review :)

AlenkaF avatar May 20 '25 13:05 AlenkaF

@benibus @pitrou would you mind taking another look?

AlenkaF avatar Jun 10 '25 11:06 AlenkaF

@pitrou gentle ping. Would I be too optimistic to try to get it into 21.0.0?

AlenkaF avatar Jun 23 '25 13:06 AlenkaF