odc-tools
odc-tools copied to clipboard
Discussion: improvements/replacements for s3-find
Introduction
s3-find
is a library function and a cli utility used for listing S3 buckets with some basic "globing" support. It's an important tool used for keeping various databases in sync with S3 buckets and also for data investigations. But there are some serious issues and performance pitfalls.
Problems
Main problem is dependency on aibotocore
(can be "fixed" by moving away from async model an into threaded model). There are also some limitations in the way globing works.
Actions
Let's discuss what we want to do about this tool, evaluate alternatives like s5cmd
, minio/mc
etc.
Refs #167 #30 #149 #105
This is a pure python implementation: https://github.com/bloomreach/s4cmd built for performance.
I agree that we should try to remove the aibotocore
dependency.
Ref #167 , it can do //**/
fine, but s3-to-dc
better with some more informative messages when it can't deal with certain patterns, rather than general message saying Added 0 datasets and failed 0 datasets.