aws-sdk-pandas
aws-sdk-pandas copied to clipboard
wr.s3.download fits the whole file into memory, with 2x memory allocation
Describe the bug
I was using wr.s3.download on 2GiB memory VM and noticed that when I download a 1006 MiB GZIP file from S3 it allocates ~2295 MiB in both cases with and without use_threads parameter. It was measured using this memory profiler.
Obviously my script fails with OOM error on 2GiB memory machine with 2 CPUs. dmesg gives a little different memory estimation:
$ dmesg | tail -1
Out of memory: Killed process 10020 (python3) total-vm:2573584kB, anon-rss:1644684kB, file-rss:4kB, shmem-rss:0kB, UID:1000 pgtables:3844kB oom_score_adj:0
It turns out that wr.s3.download by default uses botocore's s3.get_object and fits whole response into a memory:
https://github.com/aws/aws-sdk-pandas/blob/7e83b89e96af33ff6eb91f6801d8b66dcd98d4f2/awswrangler/s3/_fs.py#L65-L75
Is it possible to chunkify reading of botocore response in awswrangler to be more memory efficient?
For instance, using the following snippet I got my file without any issues on the same machine:
raw_stream = s3.get_object(**kwargs)["Body"]
with open("test_botocore_iter_chunks.gz", 'wb') as f:
for chunk in iter(lambda: raw_stream.read(64 * 1024), b''):
f.write(chunk)
I tried to use wr.config.s3_block_size parameter expecting to chunkify the response but it does not help. After setting the s3_block_size up to be less than the file size you fall into this if condition:
https://github.com/aws/aws-sdk-pandas/blob/7e83b89e96af33ff6eb91f6801d8b66dcd98d4f2/awswrangler/s3/_fs.py#L326
which just fits the whole response into a memory
How to Reproduce
use memory profiler on
wr.s3.download(path, local_file)
Expected behavior
Please let me know if it's already possible to read chunkified response
Your project
No response
Screenshots
No response
OS
Linux
Python version
3.6.9 -- this is old, but I can double check on newer versions
AWS SDK for pandas version
2.14.0
Additional context
No response