odc-tools icon indicating copy to clipboard operation
odc-tools copied to clipboard

Discussion: improvements/replacements for s3-find

Open Kirill888 opened this issue 3 years ago • 3 comments

Introduction

s3-find is a library function and a cli utility used for listing S3 buckets with some basic "globing" support. It's an important tool used for keeping various databases in sync with S3 buckets and also for data investigations. But there are some serious issues and performance pitfalls.

Problems

Main problem is dependency on aibotocore (can be "fixed" by moving away from async model an into threaded model). There are also some limitations in the way globing works.

Actions

Let's discuss what we want to do about this tool, evaluate alternatives like s5cmd, minio/mc etc.

Kirill888 avatar Aug 25 '21 07:08 Kirill888

Refs #167 #30 #149 #105

Kirill888 avatar Aug 25 '21 07:08 Kirill888

This is a pure python implementation: https://github.com/bloomreach/s4cmd built for performance.

I agree that we should try to remove the aibotocore dependency.

alexgleith avatar Aug 25 '21 07:08 alexgleith

Ref #167 , it can do //**/ fine, but s3-to-dc better with some more informative messages when it can't deal with certain patterns, rather than general message saying Added 0 datasets and failed 0 datasets.

emmaai avatar Nov 22 '21 02:11 emmaai