Clarify Error Logs for GCS Directory/Object Conflicts
Apache Iceberg version
0.9.0 (latest release)
Please describe the bug 🐞
Hi team :)
I had a problem writing an Iceberg table to GCS via the JDBC catalog, but I solved it. However, I want to fix the logging around it.
The error log said that the file already existed in the bucket — but there was actually nothing there. So, I printed the object to find out what type it was, and it turned out to be a directory.
Finally, I found the issue and was able to write the data successfully. How did I solve the problem? I just set my environment variable like below: export GOOGLE_APPLICATION_CREDENTIALS="" (I founded here. https://github.com/apache/arrow/blob/main/python/pyarrow/_gcsfs.pyx#L31)
However, the log was confusing to me. If the problem was a lack of permission to read the GCS bucket, the error log should have said something like, "Cannot find bucket — please check your permissions" instead.
So, I want to improve this to prevent others from wasting time like I did.
Do you have any ideas on how to fix or improve the logging?
Lastly, I found the get_file_info method here, but I couldn't dig deeper yet... https://github.com/apache/arrow/blob/main/python/pyarrow/fs.py#L268
Willingness to contribute
- [ ] I can contribute a fix for this bug independently
- [x] I would be willing to contribute a fix for this bug with guidance from the Iceberg community
- [ ] I cannot contribute a fix for this bug at this time
Also, in the Object Storage Engine, it is weird that there is a directory type and that file is not directory.
I found where that file location information come from!
https://github.com/fsspec/filesystem_spec/blob/master/fsspec/spec.py#L97
Thanks for raising this @HyunWooZZ! 🙌 Are you interested in providing a fix?
Thanks for replying, @Fokko! Yes :) I'm definitely interested in fixing the above. Yesterday, I found something related to the above log:
However, is it our responsibility to handle that log in this project? Based on the Arrow file system code, I believe our error handling is quite robust.
https://github.com/apache/arrow/blob/main/python/pyarrow/_fs.pyx
def get_file_info(self, paths_or_selector):
"""
Get info for the given files.
Any symlink is automatically dereferenced, recursively. A non-existing
or unreachable file returns a FileStat object and has a FileType of
value NotFound. An exception indicates a truly exceptional condition
(low-level I/O error, etc.).
Parameters
----------
paths_or_selector : FileSelector, path-like or list of path-likes
Either a selector object, a path-like object or a list of
path-like objects. The selector's base directory will not be
part of the results, even if it exists. If it doesn't exist,
use `allow_not_found`.
Returns
-------
FileInfo or list of FileInfo
Single FileInfo object is returned for a single path, otherwise
a list of FileInfo objects is returned.
Examples
--------
>>> local
<pyarrow._fs.LocalFileSystem object at ...>
>>> local.get_file_info("/{}/pyarrow-fs-example.dat".format(local_path))
<FileInfo for '/.../pyarrow-fs-example.dat': type=FileType.File, size=4>
"""
cdef:
CFileInfo info
c_string path
vector[CFileInfo] infos
vector[c_string] paths
CFileSelector selector
if isinstance(paths_or_selector, FileSelector):
with nogil:
selector = (<FileSelector>paths_or_selector).selector
infos = GetResultValue(self.fs.GetFileInfo(selector))
elif isinstance(paths_or_selector, (list, tuple)):
paths = [_path_as_bytes(s) for s in paths_or_selector]
with nogil:
infos = GetResultValue(self.fs.GetFileInfo(paths))
elif isinstance(paths_or_selector, (bytes, str)):
path =_path_as_bytes(paths_or_selector)
with nogil:
info = GetResultValue(self.fs.GetFileInfo(path))
return FileInfo.wrap(info)
else:
raise TypeError('Must pass either path(s) or a FileSelector')
return [FileInfo.wrap(info) for info in infos]
I added some PyIceberg code that retrieves file information from the Arrow file system to make it easier for you to find.
iceberg-python.pyiceberg.io.pyarrow.py
def _file_info(self) -> FileInfo:
"""Retrieve a pyarrow.fs.FileInfo object for the location.
Raises:
PermissionError: If the file at self.location cannot be accessed due to a permission error such as
an AWS error code 15.
"""
try:
file_info = self._filesystem.get_file_info(self._path)
except OSError as e:
if e.errno == 13 or "AWS Error [code 15]" in str(e):
raise PermissionError(f"Cannot get file info, access denied: {self.location}") from e
raise # pragma: no cover - If some other kind of OSError, raise the raw error
if file_info.type == FileType.NotFound:
raise FileNotFoundError(f"Cannot get file info, file not found: {self.location}")
return file_info
Hi @HyunWooZZ thanks for raising this issue! I want to summary my understanding
I had a problem writing an Iceberg table to GCS via the JDBC catalog, but I solved it.
Looking at the screenshot. It looks like you're running create_table with a GCS table location. On table creation, the metadata json file write failed with FileExistsError.
The error log said that the file already existed in the bucket — but there was actually nothing there. So, I printed the object to find out what type it was, and it turned out to be a directory.
do you know how this directory was created? Seems like this might be the real problem. GCS created a directory named gs://.../....metadata.json and this conflicts with the metadata json write above.
How did I solve the problem? I just set my environment variable like below: export GOOGLE_APPLICATION_CREDENTIALS=""
this implies that there was a permission issue. And additional credential resolved it. when you created the catalog, did you provide gcs credentials like so https://py.iceberg.apache.org/configuration/#google-cloud-storage
Hi, @kevinjqliu :) Thank you for summarizing.
Looking at the screenshot. It looks like you're running create_table with a GCS table location. On table creation, the metadata json file write failed with FileExistsError.
You're right! Before creating a file like metadata, the create method checks if the file exists in the filesystem:
try:
if not overwrite and self.exists() is True:
raise FileExistsError(f"Cannot create file, already exists: {self.location}")
do you know how this directory was created? Seems like this might be the real problem. GCS created a directory named gs://.../....metadata.json and this conflicts with the metadata json write above.
No, there was nothing there. Also, as you know, in an object storage system, there's no concept of a directory. So I think the Arrow GCS filesystem ad lib returns the object information because gcsfs cannot check GCS objects without credentials. ==> This part is really weird to me...
this implies that there was a permission issue. And additional credential resolved it. when you created the catalog, did you provide gcs credentials like so https://py.iceberg.apache.org/configuration/#google-cloud-storage
I didn’t include the OAuth token. I just added the project ID and locations. If I had generated an OAuth token and injected it into the config kwargs, it probably would have worked. However, I think OAuth tokens are generally not suitable for server-side jobs like event-driven or scheduled jobs, because it would require generating and injecting the token every time before running the job.
I think this might be the clue that’s worth digging into further: https://github.com/apache/arrow/blob/main/cpp/src/arrow/filesystem/gcsfs.cc
Result<FileInfo> GetFileInfo(const GcsPath& path) {
if (path.object.empty()) {
auto meta = client_.GetBucketMetadata(path.bucket);
return GetFileInfoBucket(path, std::move(meta).status());
}
auto meta = client_.GetObjectMetadata(path.bucket, path.object);
Result<FileInfo> info = GetFileInfoObject(path, meta);
if (!info.ok() || info->type() != FileType::NotFound) {
return info;
}
// Not found case. It could be this was written to GCS with a different
// "Directory" convention, so if there is at least one object that
// matches the prefix we assume it is a directory.
std::string canonical = internal::EnsureTrailingSlash(path.object);
auto list_result = client_.ListObjects(path.bucket, gcs::Prefix(canonical));
if (list_result.begin() != list_result.end()) {
// If there is at least one result it indicates this is a directory (at
// least one object exists that starts with "path/")
return FileInfo(path.full_path, FileType::Directory);
}
// Return the original not-found info if there was no match.
return info;
}
The method above is used to retrieve file information in the FileSystem.get_file_info method: arrow.python.pyarrow._fs.pyx
I will test something and after that i will share more details.
Interesting, so self.exists() here returnsTrue. Which means self._file_info() returned an object and did not error
It'll be helpful to log that return value.
it should return with type FileType.NotFound and raise FileNotFoundError
According to the docs
A non-existing or unreachable file returns a FileStat object and has a FileType of value NotFound.
the "unreachable file" part is interesting.
Also looking at the code, GCS is doing something with the not found case here
It'll be helpful for debugging to see what the return value of self._file_info() is
Sure @kevinjqliu :)
Below the image, top of the logs is the return type. <pyarrow._gcsfs.GcsFileSystem object at > FileInfo for gs://seoul_data_v1/warehouse/test/metadata/00000-6f9631fe-18db-4743-915b-36d7562d4073.metadata.json : type=FileType.Directory
My ec2 is provisioned on GCP.
Basically they set the gcloud account owner.
So I can view the bucket inside.
my code is this.
catalog.create_table("analytics.test", schema=arrow_table.schema, location="gs://seoul_data_v1/warehouse/test")
My OS info
PRETTY_NAME="Debian GNU/Linux 12 (bookworm)"
NAME="Debian GNU/Linux"
VERSION_ID="12"
VERSION="12 (bookworm)"
VERSION_CODENAME=bookworm
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
Thanks for the debug log!
<pyarrow._gcsfs.GcsFileSystem object at > FileInfo for location : type=FileType.Directory
is definitely the issue. we check if the FIleType is NotFound to raise FileNotFoundError
We're hitting this line, https://github.com/apache/arrow/blob/067fd2a2c6e54d33b9ae8a3324f59bebe960d485/cpp/src/arrow/filesystem/gcsfs.cc#L359
what is the path.full_path? The debug log didn't show the location.
@kevinjqliu location = 'seoul_data_v1/warehouse/test/metadata/00000-6f9631fe-18db-4743-915b-36d7562d4073.metadata.json'
hmmm. this is saying that the location is a directory...
can you do something like ls on gs://seoul_data_v1/warehouse/test/metadata/?
particularly are there any objects under seoul_data_v1/warehouse/test/metadata/00000-6f9631fe-18db-4743-915b-36d7562d4073.metadata.json/? Note the / at the very end.
@kevinjqliu There is nothing. I ran the gsutil ls command beforehand, and the seoul_data_v1/warehouse/test/ directory did not exist.
As you know, 00000-6f9631fe-18db-4743-915b-36d7562d4073.metadata.json is a unique ID generated by PyIceberg.
So, I reported the issue here https://github.com/apache/arrow/issues/46414
Once PyArrow 21.0.0 is released, we will be able to test the above case. Until then, I will wait.
This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.
This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'