aws-sdk-pandas
aws-sdk-pandas copied to clipboard
wr.s3.read_parquet() errors if any files are missing
Is your idea related to a problem? Please describe.
Currently when using wr.s3.read_parquet(path=list_of_paths)
with a list of file paths, if any individual file does not exist, the following error is raised:
botocore.exceptions.ClientError: An error occurred (404) when calling the HeadObject operation: Not Found
As a work around I've used wr.s3.does_object_exist()
but this can be slow if being used for many files as it only accepts one file path.
Describe the solution you'd like
I would like if wr.s3.read_parquet()
had an option argument like error_missing_files=
where the user could specify whether to raise an error/warning or just ignore when files are missing.
Alternatively (or additionally) allowing wr.s3.does_object_exist()
to accept a list of files for path=
and checking if the objects exist in parallel.
Any reason why you are not simply passing an S3 prefix instead of a list of objects? With an S3 prefix argument (e.g. s3://my-bucket/my-prefix/
) the method lists all objects, ensuring only existing objects are consumed. With a list, it's assumed the user has already checked the objects are there so I am not sure it should be the library's responsibility to check it
Also note since #1246 wrangler with throw NoFilesFound
exception on missing files instead of 404
@jaidisido
In my use case, all paths aren't always under the same prefix. If my path is s3://my-bucket/year/month/day/hour/<some_hash>.parquet
. The hash is based on an id and date time so I can generate what the hashes should be if the file exists, but a generic prefix won't cover it.
If I could re-engineer the data lake I would, it's outside my control.
Currently the fastest solution I found is not using awswrangler at all, and just using pandas read_parquet()
in parallel, failing when there is no path, and merging at the end.
I am currently facing the same issue. Have you find anything faster than using pd.read_parquet in parallel @dhorkel ? Maybe downloading using asyncio not to be limited by the number of cores?
Marking this issue as stale due to inactivity. This helps our maintainers find and focus on the active issues. If this issue receives no comments in the next 7 days it will automatically be closed.
+1 for a kwarg to ignore missing files to avoid having to write my own multi-threaded head object code