aws-sdk-pandas icon indicating copy to clipboard operation
aws-sdk-pandas copied to clipboard

wr.s3.read_parquet() errors if any files are missing

Open dhorkel opened this issue 2 years ago • 6 comments

Is your idea related to a problem? Please describe.

Currently when using wr.s3.read_parquet(path=list_of_paths) with a list of file paths, if any individual file does not exist, the following error is raised:

botocore.exceptions.ClientError: An error occurred (404) when calling the HeadObject operation: Not Found

As a work around I've used wr.s3.does_object_exist() but this can be slow if being used for many files as it only accepts one file path.

Describe the solution you'd like I would like if wr.s3.read_parquet() had an option argument like error_missing_files= where the user could specify whether to raise an error/warning or just ignore when files are missing.

Alternatively (or additionally) allowing wr.s3.does_object_exist() to accept a list of files for path= and checking if the objects exist in parallel.

dhorkel avatar May 06 '22 15:05 dhorkel

Any reason why you are not simply passing an S3 prefix instead of a list of objects? With an S3 prefix argument (e.g. s3://my-bucket/my-prefix/) the method lists all objects, ensuring only existing objects are consumed. With a list, it's assumed the user has already checked the objects are there so I am not sure it should be the library's responsibility to check it

jaidisido avatar May 09 '22 09:05 jaidisido

Also note since #1246 wrangler with throw NoFilesFound exception on missing files instead of 404

kukushking avatar May 09 '22 09:05 kukushking

@jaidisido

In my use case, all paths aren't always under the same prefix. If my path is s3://my-bucket/year/month/day/hour/<some_hash>.parquet. The hash is based on an id and date time so I can generate what the hashes should be if the file exists, but a generic prefix won't cover it.

If I could re-engineer the data lake I would, it's outside my control.

Currently the fastest solution I found is not using awswrangler at all, and just using pandas read_parquet() in parallel, failing when there is no path, and merging at the end.

dhorkel avatar May 19 '22 18:05 dhorkel

I am currently facing the same issue. Have you find anything faster than using pd.read_parquet in parallel @dhorkel ? Maybe downloading using asyncio not to be limited by the number of cores?

netomenoci-monoceros avatar Jul 13 '22 15:07 netomenoci-monoceros

Marking this issue as stale due to inactivity. This helps our maintainers find and focus on the active issues. If this issue receives no comments in the next 7 days it will automatically be closed.

github-actions[bot] avatar Sep 11 '22 18:09 github-actions[bot]

+1 for a kwarg to ignore missing files to avoid having to write my own multi-threaded head object code

rupertcw avatar Feb 22 '24 11:02 rupertcw