aws-sdk-pandas wr.s3.download fits the whole file into memory, with 2x memory allocation

wr.s3.download fits the whole file into memory, with 2x memory allocation

Open roykoand opened this issue 1 year ago • 0 comments

Describe the bug

I was using wr.s3.download on 2GiB memory VM and noticed that when I download a 1006 MiB GZIP file from S3 it allocates ~2295 MiB in both cases with and without use_threads parameter. It was measured using this memory profiler.

Obviously my script fails with OOM error on 2GiB memory machine with 2 CPUs. dmesg gives a little different memory estimation:

$ dmesg  | tail -1
Out of memory: Killed process 10020 (python3) total-vm:2573584kB, anon-rss:1644684kB, file-rss:4kB, shmem-rss:0kB, UID:1000 pgtables:3844kB oom_score_adj:0

It turns out that wr.s3.download by default uses botocore's s3.get_object and fits whole response into a memory:

https://github.com/aws/aws-sdk-pandas/blob/7e83b89e96af33ff6eb91f6801d8b66dcd98d4f2/awswrangler/s3/_fs.py#L65-L75

Is it possible to chunkify reading of botocore response in awswrangler to be more memory efficient?

For instance, using the following snippet I got my file without any issues on the same machine:

raw_stream = s3.get_object(**kwargs)["Body"]

with open("test_botocore_iter_chunks.gz", 'wb') as f:
    for chunk in iter(lambda: raw_stream.read(64 * 1024), b''):
        f.write(chunk)

I tried to use wr.config.s3_block_size parameter expecting to chunkify the response but it does not help. After setting the s3_block_size up to be less than the file size you fall into this if condition:

https://github.com/aws/aws-sdk-pandas/blob/7e83b89e96af33ff6eb91f6801d8b66dcd98d4f2/awswrangler/s3/_fs.py#L326

which just fits the whole response into a memory

How to Reproduce

use memory profiler on

wr.s3.download(path, local_file)

Expected behavior

Please let me know if it's already possible to read chunkified response

Your project

No response

Screenshots

No response

OS

Linux

Python version

3.6.9 -- this is old, but I can double check on newer versions

AWS SDK for pandas version

2.14.0

Additional context

No response

May 22 '24 08:05 roykoand

aws-sdk-pandas aws-sdk-pandas copied to clipboard

wr.s3.download fits the whole file into memory, with 2x memory allocation

Describe the bug

How to Reproduce

Expected behavior

Your project

Screenshots

OS

Python version

AWS SDK for pandas version

Additional context

aws-sdk-pandas
aws-sdk-pandas copied to clipboard