hadoop-connectors
hadoop-connectors copied to clipboard
Implementation of batch listFiles API's for GoogleHadoopFileSystem
There are API defined in org.apache.hadoop.fs.FileSystem abstraction which are used to list file and return a Iterator.
https://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html#listFiles-org.apache.hadoop.fs.Path-boolean-
https://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html#listStatusIterator-org.apache.hadoop.fs.Path-
The iterator is an on-demand iterator i.e. instead of listing all the objects under a path in one go it does the listing in batches. As and when a batch is consumed then a call is made to get the next batch. This is very useful is you are trying to list a container which has very large number of files as it saves on the amount of heap that is consumed and the call also completes pretty fast (gets one batch) if there are very large number of files in the container.
We at dremio had a use case where we would list containers with very large number of files (order of 100K) and therefore wanted to implement the batch listing functionality in GoogleHadoopFileSystem. Wanted to know the process of submitting a proposal to implement these API and contribute them here and open a discussion on how the release system works for this repo..
Tagging relevant people here. Let me know if this discussion can happen on mail or any other suitable forum. Couldn't find a jira for this project. @mprashanthsagar @medb @pzampino
@davidrabinowitz
@mayanks
Thanks for this. Implementing better listFiles() is on our roadmap.
@davidrabinowitz , with dremio use cases. I have raised the PR https://github.com/GoogleCloudDataproc/hadoop-connectors/pull/585. Is it possible for u to review it?