S3: Relying on load to check file existance if is file/dir causes high error rate on S3
Hello,
Recently I started using cloudpath library to upload entire directories to S3, but suddenly I noticed a weird behavior that we got a high error rate at S3 (according to AWS monitors) with the following errors: botocore.errorfactory.NoSuchKey: An error occurred (NoSuchKey) when calling the GetObject operation: The specified key does not exist. botocore.exceptions.ClientError: An error occurred (404) when calling the HeadObject operation: Not Found
And after a deep investigation I have found that those errors came from cloudpath.py::upload_from method since the library checks for each file whether it is exists or not before uploading it (if self.exists() and self.is_dir())
def upload_from(
self, source: Union[str, os.PathLike], force_overwrite_to_cloud: bool = False
) -> "CloudPath":
"""Upload a file or directory to the cloud path."""
source = Path(source)
if source.is_dir():
for p in source.iterdir():
(self / p.name).upload_from(p, force_overwrite_to_cloud=force_overwrite_to_cloud)
return self
else:
**if self.exists() and self.is_dir():
dst = self / source.name**
else:
dst = self
dst._upload_file_to_cloud(source, force_overwrite_to_cloud=force_overwrite_to_cloud)
return dst
My question is: Why we need this check ? its redundant because self.is_dir() always returns False.
Why we need this check ? its redundant because self.is_dir() always returns False.
Am I understanding you correctly that you think the self.is_dir() isn't necessary when self.exists() is False? I think this specifically isn't actually a problem. Python's and operator will short-circuit, which means if self.exists() evaluates to False, it doesn't actually even execute self.is_dir(). Documentation here,
The check for self.is_dir() is needed in general (if self.exists() is True) to match the expected behavior when you try to move a file to a directory. If your source is a file and destination is a directory, most programs will put your file into that directory, instead of moving your file to where that directory is.
high error rate at S3 (according to AWS monitors) with the following errors: botocore.errorfactory.NoSuchKey: An error occurred (NoSuchKey) when calling the GetObject operation: The specified key does not exist. botocore.exceptions.ClientError: An error occurred (404) when calling the HeadObject operation: Not Found
~This sounds like it might be the issue to look into more closely. I don't think we expect to get NoSuchKey errors under normal use. Can you tell us more about what you're doing so we can try to reproduce this error?~ I misunderstood. These are expected errors and would get picked up by AWS though users shouldn't expect to see them in Python during runtime.
One additional potential change here to reduce network calls is that maybe we can leverage the force_overwrite_to_cloud flag to skip certain checks.
I don't think we expect to get NoSuchKey errors under normal use.
I think we do depend on these errors when doing existence and file/dir checks. Maybe there's a nicer way to have boto3 ask if a file exists, but I am not sure what it is. Even the boto3 waiter object polls head_object to check for existence (which is the same as the load method we use).
https://github.com/drivendataorg/cloudpathlib/blob/3901746362b5a489f47d7192f9aed50c284369d5/cloudpathlib/s3/s3client.py#L133-L154
Ok, now I understand why we need self.is_dir(), but still don't understand why we need self.exists (which is the main cause of the problem).
Please note that there are actually no folders in S3, just a flat file structure. S3 has a flat structure with no hierarchy like you would see in a typical file system. However, for the sake of organizational simplicity, S3 supports the folder concept as a means of grouping objects.
For example, lets assume we have an empty bucket called 'my_bucket' and we want to upload file called x.json to s3://my_bucket/folder_1/folder_2
Then we just need to tell boto3 to upload x.json to s3://my_bucket/folder_1/folder_2/x.json and S3 will manage the creation of folder_1/folder_2, so we don't really need to check if they already exists.
So I believe that self.exists is redundant and we can remove it at least for AWS.
This is working as intended and is not redundant. Consider the case where I have this file exists s3://bucket/dir/file1.txt. I also have a local directory local_dir with the files other1.txt and other2.txt that I want to upload.
CloudPath("s3://bucket/dir/").upload_from("./local_dir")
This should result in the following files existing on s3.
s3://bucket/dir/file1.txt
s3://bucket/dir/local_dir/other1.txt
s3://bucket/dir/local_dir/other2.txt
This is because cloudpathlib provides a pathlib-like interface for cloud storage providers. The pathlib interface supports thinking about both directories and files. Because of this, our APIs let developers write code with both directories and files whether or not those are abstractions in the underlying storage provider.
Like I already mentioned, the cause of the errors is the method we use to check for existence on S3, not the check itself in the upload_from method.
We used to have an implementation that did not result in these kind of errors, but that implementation made it impossible to use cloudpathlib with buckets where the user did not have list permissions (see #168).
Ideally, we would have an existence check that (1) doesn't generate errors on S3, (2) doesn't depend on a try-except in our codebase, (3) doesn't require list permissions. Maybe that's possible with boto3, and if so we'd accept a PR.
I agree that the root of the problem came from the way you check files existence in S3.
But I still not convinced why we need self.exists() when we providing force_overwrite_to_cloud=True
(please think about the following change (force_overwrite_to_cloud || self.exists()) and self.is_dir() ).
First of all, I run cloudpathlib on your example and it behave differently.
The results were as follows:
s3://bucket/dir/file1.txt
s3://bucket/dir/other1.txt
s3://bucket/dir/other2.txt
and self.exists() still return False.
The only case I can think about in which self.exists() and self.is_dir() return True is the following:
CloudPath("s3://bucket/dir").upload_from("./local_dir/other1.txt")
and again, in such cases we don't need this check. Please let me know if you have other edge cases.