io icon indicating copy to clipboard operation
io copied to clipboard

Orc hdfs

Open 372046933 opened this issue 3 years ago • 1 comments

Related tasks: https://github.com/tensorflow/io/issues/1372#issuecomment-1072007873 This PR use Tensorflow Filesystem API to access HDFS. Instead of relying on libhdfspp, which is not included in the current compilation setup. By the way, libhdfspp is not another wrapper of C libhdfs. But it is an implementation based on RPC protocol. Which is quite complex and some of the code seems not well maitained. IMHO, we can rely on TensorFlow's modular Filesystem HDFS API. Which is based on libhdfs and quite stable. libtensorflow_io_plugins.so is loaded when import tensorflow_io is executed in Python. So the following C++ code

std::unique_ptr<tensorflow::RandomAccessFile> file_;
tensorflow::Env::Default()->NewRandomAccessFile("hdfs:///xxx/yyy/z", &file_);

returns a successful RandonAccessFile. In this way, we can support reading ORC from HDFS

372046933 avatar Apr 27 '22 12:04 372046933

By the way, Kerberos support is provided by libhdfs, libgssapi-krb5-2 etc., which must be installed on the environment. I have tested libhdfspp and found that libhdfspp does not support kerberos.

372046933 avatar Apr 27 '22 13:04 372046933