iceberg-python icon indicating copy to clipboard operation
iceberg-python copied to clipboard

Clarify Error Logs for GCS Directory/Object Conflicts

Open HyunWooZZ opened this issue 7 months ago • 2 comments

Apache Iceberg version

0.9.0 (latest release)

Please describe the bug 🐞

Hi team :)

I had a problem writing an Iceberg table to GCS via the JDBC catalog, but I solved it. However, I want to fix the logging around it.

The error log said that the file already existed in the bucket — but there was actually nothing there. So, I printed the object to find out what type it was, and it turned out to be a directory.

Image

Finally, I found the issue and was able to write the data successfully. How did I solve the problem? I just set my environment variable like below: export GOOGLE_APPLICATION_CREDENTIALS="" (I founded here. https://github.com/apache/arrow/blob/main/python/pyarrow/_gcsfs.pyx#L31)

However, the log was confusing to me. If the problem was a lack of permission to read the GCS bucket, the error log should have said something like, "Cannot find bucket — please check your permissions" instead.

So, I want to improve this to prevent others from wasting time like I did.

Do you have any ideas on how to fix or improve the logging?

Lastly, I found the get_file_info method here, but I couldn't dig deeper yet... https://github.com/apache/arrow/blob/main/python/pyarrow/fs.py#L268

Willingness to contribute

  • [ ] I can contribute a fix for this bug independently
  • [x] I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • [ ] I cannot contribute a fix for this bug at this time

HyunWooZZ avatar Apr 28 '25 13:04 HyunWooZZ

Also, in the Object Storage Engine, it is weird that there is a directory type and that file is not directory.

HyunWooZZ avatar Apr 28 '25 22:04 HyunWooZZ

I found where that file location information come from!

https://github.com/fsspec/filesystem_spec/blob/master/fsspec/spec.py#L97

Image Image

HyunWooZZ avatar Apr 28 '25 23:04 HyunWooZZ

Thanks for raising this @HyunWooZZ! 🙌 Are you interested in providing a fix?

Fokko avatar May 02 '25 11:05 Fokko

Thanks for replying, @Fokko! Yes :) I'm definitely interested in fixing the above. Yesterday, I found something related to the above log:

However, is it our responsibility to handle that log in this project? Based on the Arrow file system code, I believe our error handling is quite robust.

https://github.com/apache/arrow/blob/main/python/pyarrow/_fs.pyx

  def get_file_info(self, paths_or_selector):
        """
        Get info for the given files.

        Any symlink is automatically dereferenced, recursively. A non-existing
        or unreachable file returns a FileStat object and has a FileType of
        value NotFound. An exception indicates a truly exceptional condition
        (low-level I/O error, etc.).

        Parameters
        ----------
        paths_or_selector : FileSelector, path-like or list of path-likes
            Either a selector object, a path-like object or a list of
            path-like objects. The selector's base directory will not be
            part of the results, even if it exists. If it doesn't exist,
            use `allow_not_found`.

        Returns
        -------
        FileInfo or list of FileInfo
            Single FileInfo object is returned for a single path, otherwise
            a list of FileInfo objects is returned.

        Examples
        --------
        >>> local
        <pyarrow._fs.LocalFileSystem object at ...>
        >>> local.get_file_info("/{}/pyarrow-fs-example.dat".format(local_path))
        <FileInfo for '/.../pyarrow-fs-example.dat': type=FileType.File, size=4>
        """
        cdef:
            CFileInfo info
            c_string path
            vector[CFileInfo] infos
            vector[c_string] paths
            CFileSelector selector

        if isinstance(paths_or_selector, FileSelector):
            with nogil:
                selector = (<FileSelector>paths_or_selector).selector
                infos = GetResultValue(self.fs.GetFileInfo(selector))
        elif isinstance(paths_or_selector, (list, tuple)):
            paths = [_path_as_bytes(s) for s in paths_or_selector]
            with nogil:
                infos = GetResultValue(self.fs.GetFileInfo(paths))
        elif isinstance(paths_or_selector, (bytes, str)):
            path =_path_as_bytes(paths_or_selector)
            with nogil:
                info = GetResultValue(self.fs.GetFileInfo(path))
            return FileInfo.wrap(info)
        else:
            raise TypeError('Must pass either path(s) or a FileSelector')

        return [FileInfo.wrap(info) for info in infos]

HyunWooZZ avatar May 02 '25 14:05 HyunWooZZ

I added some PyIceberg code that retrieves file information from the Arrow file system to make it easier for you to find.

iceberg-python.pyiceberg.io.pyarrow.py



    def _file_info(self) -> FileInfo:
        """Retrieve a pyarrow.fs.FileInfo object for the location.

        Raises:
            PermissionError: If the file at self.location cannot be accessed due to a permission error such as
                an AWS error code 15.
        """
        try:
            file_info = self._filesystem.get_file_info(self._path)
        except OSError as e:
            if e.errno == 13 or "AWS Error [code 15]" in str(e):
                raise PermissionError(f"Cannot get file info, access denied: {self.location}") from e
            raise  # pragma: no cover - If some other kind of OSError, raise the raw error

        if file_info.type == FileType.NotFound:
            raise FileNotFoundError(f"Cannot get file info, file not found: {self.location}")
        return file_info


HyunWooZZ avatar May 02 '25 15:05 HyunWooZZ

Hi @HyunWooZZ thanks for raising this issue! I want to summary my understanding

I had a problem writing an Iceberg table to GCS via the JDBC catalog, but I solved it.

Looking at the screenshot. It looks like you're running create_table with a GCS table location. On table creation, the metadata json file write failed with FileExistsError.

The error log said that the file already existed in the bucket — but there was actually nothing there. So, I printed the object to find out what type it was, and it turned out to be a directory.

do you know how this directory was created? Seems like this might be the real problem. GCS created a directory named gs://.../....metadata.json and this conflicts with the metadata json write above.

How did I solve the problem? I just set my environment variable like below: export GOOGLE_APPLICATION_CREDENTIALS=""

this implies that there was a permission issue. And additional credential resolved it. when you created the catalog, did you provide gcs credentials like so https://py.iceberg.apache.org/configuration/#google-cloud-storage

kevinjqliu avatar May 04 '25 16:05 kevinjqliu

Hi, @kevinjqliu :) Thank you for summarizing.

Looking at the screenshot. It looks like you're running create_table with a GCS table location. On table creation, the metadata json file write failed with FileExistsError.

You're right! Before creating a file like metadata, the create method checks if the file exists in the filesystem:

        try:
            if not overwrite and self.exists() is True:
                raise FileExistsError(f"Cannot create file, already exists: {self.location}")

do you know how this directory was created? Seems like this might be the real problem. GCS created a directory named gs://.../....metadata.json and this conflicts with the metadata json write above.

No, there was nothing there. Also, as you know, in an object storage system, there's no concept of a directory. So I think the Arrow GCS filesystem ad lib returns the object information because gcsfs cannot check GCS objects without credentials. ==> This part is really weird to me...

this implies that there was a permission issue. And additional credential resolved it. when you created the catalog, did you provide gcs credentials like so https://py.iceberg.apache.org/configuration/#google-cloud-storage

I didn’t include the OAuth token. I just added the project ID and locations. If I had generated an OAuth token and injected it into the config kwargs, it probably would have worked. However, I think OAuth tokens are generally not suitable for server-side jobs like event-driven or scheduled jobs, because it would require generating and injecting the token every time before running the job.

HyunWooZZ avatar May 05 '25 13:05 HyunWooZZ

I think this might be the clue that’s worth digging into further: https://github.com/apache/arrow/blob/main/cpp/src/arrow/filesystem/gcsfs.cc

  Result<FileInfo> GetFileInfo(const GcsPath& path) {
    if (path.object.empty()) {
      auto meta = client_.GetBucketMetadata(path.bucket);
      return GetFileInfoBucket(path, std::move(meta).status());
    }
    auto meta = client_.GetObjectMetadata(path.bucket, path.object);
    Result<FileInfo> info = GetFileInfoObject(path, meta);
    if (!info.ok() || info->type() != FileType::NotFound) {
      return info;
    }
    // Not found case.  It could be this was written to GCS with a different
    // "Directory" convention, so if there is at least one object that
    // matches the prefix we assume it is a directory.
    std::string canonical = internal::EnsureTrailingSlash(path.object);
    auto list_result = client_.ListObjects(path.bucket, gcs::Prefix(canonical));
    if (list_result.begin() != list_result.end()) {
      // If there is at least one result it indicates this is a directory (at
      // least one object exists that starts with "path/")
      return FileInfo(path.full_path, FileType::Directory);
    }
    // Return the original not-found info if there was no match.
    return info;
  }

The method above is used to retrieve file information in the FileSystem.get_file_info method: arrow.python.pyarrow._fs.pyx

I will test something and after that i will share more details.

HyunWooZZ avatar May 05 '25 14:05 HyunWooZZ

Interesting, so self.exists() here returnsTrue. Which means self._file_info() returned an object and did not error It'll be helpful to log that return value.

it should return with type FileType.NotFound and raise FileNotFoundError

According to the docs

A non-existing or unreachable file returns a FileStat object and has a FileType of value NotFound.

the "unreachable file" part is interesting.

Also looking at the code, GCS is doing something with the not found case here

It'll be helpful for debugging to see what the return value of self._file_info() is

kevinjqliu avatar May 05 '25 18:05 kevinjqliu

Sure @kevinjqliu :)

Below the image, top of the logs is the return type. <pyarrow._gcsfs.GcsFileSystem object at > FileInfo for gs://seoul_data_v1/warehouse/test/metadata/00000-6f9631fe-18db-4743-915b-36d7562d4073.metadata.json : type=FileType.Directory

Image

HyunWooZZ avatar May 06 '25 00:05 HyunWooZZ

My ec2 is provisioned on GCP. Basically they set the gcloud account owner. So I can view the bucket inside. Image

my code is this.

    catalog.create_table("analytics.test", schema=arrow_table.schema, location="gs://seoul_data_v1/warehouse/test")

My OS info

PRETTY_NAME="Debian GNU/Linux 12 (bookworm)"
NAME="Debian GNU/Linux"
VERSION_ID="12"
VERSION="12 (bookworm)"
VERSION_CODENAME=bookworm
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

HyunWooZZ avatar May 06 '25 05:05 HyunWooZZ

Thanks for the debug log!

<pyarrow._gcsfs.GcsFileSystem object at > FileInfo for location : type=FileType.Directory

is definitely the issue. we check if the FIleType is NotFound to raise FileNotFoundError

We're hitting this line, https://github.com/apache/arrow/blob/067fd2a2c6e54d33b9ae8a3324f59bebe960d485/cpp/src/arrow/filesystem/gcsfs.cc#L359

what is the path.full_path? The debug log didn't show the location.

kevinjqliu avatar May 06 '25 15:05 kevinjqliu

@kevinjqliu location = 'seoul_data_v1/warehouse/test/metadata/00000-6f9631fe-18db-4743-915b-36d7562d4073.metadata.json'

HyunWooZZ avatar May 06 '25 21:05 HyunWooZZ

hmmm. this is saying that the location is a directory...

can you do something like ls on gs://seoul_data_v1/warehouse/test/metadata/? particularly are there any objects under seoul_data_v1/warehouse/test/metadata/00000-6f9631fe-18db-4743-915b-36d7562d4073.metadata.json/? Note the / at the very end.

kevinjqliu avatar May 09 '25 20:05 kevinjqliu

@kevinjqliu There is nothing. I ran the gsutil ls command beforehand, and the seoul_data_v1/warehouse/test/ directory did not exist.

As you know, 00000-6f9631fe-18db-4743-915b-36d7562d4073.metadata.json is a unique ID generated by PyIceberg.

So, I reported the issue here https://github.com/apache/arrow/issues/46414

HyunWooZZ avatar May 12 '25 23:05 HyunWooZZ

Once PyArrow 21.0.0 is released, we will be able to test the above case. Until then, I will wait.

HyunWooZZ avatar May 19 '25 03:05 HyunWooZZ

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

github-actions[bot] avatar Nov 18 '25 00:11 github-actions[bot]

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'

github-actions[bot] avatar Dec 10 '25 00:12 github-actions[bot]