aws-sdk-pandas icon indicating copy to clipboard operation
aws-sdk-pandas copied to clipboard

chunked = True in wr.s3.read_parquet is not working as expected

Open PankajJ08 opened this issue 1 year ago • 1 comments

Describe the bug

Hi, I've used wr.s3.read_parquet to load my parquet file from s3. I have a very huge file in s3. A single file has 1 Million rows, which I can't load in my memory so I've tried to use chunked = True. but still memory is getting full. Also tried with chunked = 10000 but the same issue. I've 32 GB RAM and was unable to load the parquet. If I'm not wrong chunked=True returns an iterator and not load all the parquet in memory but it's what happening in my case.

`import awswrangler as wr import gc

for chunk in wr.s3.read_parquet(s3_path, chunked=True): print(chunk.shape) # ...my logic del chunk gc.collect()`

It's failing in line for chunk in wr.s3.read_parquet(s3_path, chunked=True): What I'm missing here?

How to Reproduce

*P.S. Please do not attach files as it's considered a security risk. Add code snippets directly in the message body as much as possible.*

Expected behavior

No response

Your project

No response

Screenshots

No response

OS

Linux

Python version

3.10

AWS SDK for pandas version

awswrangler version 3.3.0

Additional context

No response

PankajJ08 avatar Feb 22 '24 07:02 PankajJ08

chunked=True should indeed be more memory-friendly. Can you be more specific about what is failing? Is there a specific exception thrown?

An alternative is to use S3 select if you are able to filter the parquet file to specific columns/rows. The filtering would be done server-side instead of client-side in that case.

jaidisido avatar Feb 22 '24 09:02 jaidisido