aws-sdk-pandas
aws-sdk-pandas copied to clipboard
chunked = True in wr.s3.read_parquet is not working as expected
Describe the bug
Hi, I've used wr.s3.read_parquet to load my parquet file from s3. I have a very huge file in s3. A single file has 1 Million rows, which I can't load in my memory so I've tried to use chunked = True. but still memory is getting full. Also tried with chunked = 10000 but the same issue. I've 32 GB RAM and was unable to load the parquet. If I'm not wrong chunked=True returns an iterator and not load all the parquet in memory but it's what happening in my case.
`import awswrangler as wr import gc
for chunk in wr.s3.read_parquet(s3_path, chunked=True): print(chunk.shape) # ...my logic del chunk gc.collect()`
It's failing in line for chunk in wr.s3.read_parquet(s3_path, chunked=True):
What I'm missing here?
How to Reproduce
*P.S. Please do not attach files as it's considered a security risk. Add code snippets directly in the message body as much as possible.*
Expected behavior
No response
Your project
No response
Screenshots
No response
OS
Linux
Python version
3.10
AWS SDK for pandas version
awswrangler version 3.3.0
Additional context
No response
chunked=True
should indeed be more memory-friendly. Can you be more specific about what is failing? Is there a specific exception thrown?
An alternative is to use S3 select if you are able to filter the parquet file to specific columns/rows. The filtering would be done server-side instead of client-side in that case.