s3contents Listing contents of large s3 folders is slow

Listing contents of large s3 folders is slow

Open yoel-ross-zip opened this issue 2 years ago • 6 comments

Hey,

Thanks for your work on this library, iv'e been using it for a while and its really nice.

Recently i ran into some issues with long load time for large s3 folders. I believe this is the result of repeated synchronous calls to the abstract lstat method. I have done some testing, and found that if making these calls with asyncio, using the s3fs._info method instead really speeds things up (like 20X faster on large folders).

I'm currently using a fork i made with these changes, and it works great. I opened a PR for you to consider: https://github.com/danielfrg/s3contents/pull/139

I use this library quite a bit, and would be happy to put in the work to get this change merged.

Thanks again!

Joe

Mar 20 '22 15:03 yoel-ross-zip

Fixed thanks to your PR :) Thanks!

Mar 23 '22 12:03 danielfrg

@ziprjoe @danielfrg First of all, many thanks for your precious work! 😄 I've just installed this new modified version, because I noticed the same problem working with large directories. Sadly, I'm facing an error. It seems that the file .s3keep is present in the bucket only at the highest level, but not in the subdirectories where it is also searched. Any suggestions?

Jun 28 '22 12:06 aleny91

Hey, should be a matter of catching the exception and ignoring it. In cases where there is no s3keep file, there isn't a way to show the last update time, so a dummy date will be displayed. this PR should fix it: https://github.com/danielfrg/s3contents/pull/143

Jun 29 '22 12:06 yoel-ross-zip

@ziprjoe @danielfrg Firstly, I'd like to express my gratitude for your excellent work on this library. It has been incredibly useful for my use-case of connecting s3 with Jhub compared to the alternatives.

However, I've encountered an issue when using s3contents to connect to an S3 bucket with pre-existing directories. These directories aren't displayed in the UI unless I manually add a .s3keep file to each directory. Once I do this, the issue is resolved. I'm wondering if you are aware of the cause of this problem and if there's a way to use s3contents with a bucket that has pre-existing directories without having to manually add .s3keep files to each directory.

Thank you for your time and attention!

Apr 24 '23 06:04 fakhavan

Hi @ziproje.

I think there are new ways to handle directories in S3 that do not require the placeholder files. I have not tested and to be honest I am not using this lib anymore.

I try to keep it updated but since I am not using it, it is behind on needed features and I dont expect I will be able to add new features in the near future. I basically just handle new releases from contributors at this point.

Apr 24 '23 14:04 danielfrg

@ziprjoe @danielfrg Firstly, I'd like to express my gratitude for your excellent work on this library. It has been incredibly useful for my use-case of connecting s3 with Jhub compared to the alternatives.

However, I've encountered an issue when using s3contents to connect to an S3 bucket with pre-existing directories. These directories aren't displayed in the UI unless I manually add a .s3keep file to each directory. Once I do this, the issue is resolved. I'm wondering if you are aware of the cause of this problem and if there's a way to use s3contents with a bucket that has pre-existing directories without having to manually add .s3keep files to each directory.

Thank you for your time and attention!

I handle that with a script called in postStart lifecycle hook

file=$HOME/.dir.txt
# Save s3 directory tree
aws s3 ls --recursive s3://<bucket> | cut -c32- | xargs -d '\n' -n 1 dirname | uniq > $HOME/.dir.txt
touch .s3keep

while IFS= read -r folder; do
    aws s3 cp .s3keep s3://<bucket>/$folder/.s3keep
done < "$file"

May 16 '24 11:05 fbaldo31

s3contents s3contents copied to clipboard

Listing contents of large s3 folders is slow

s3contents
s3contents copied to clipboard