llama_index icon indicating copy to clipboard operation
llama_index copied to clipboard

[Bug]: Timedata Type Metadata Failing on GFS (SimpleDirectoryReader)

Open TheMellyBee opened this issue 8 months ago • 5 comments
trafficstars

Bug Description

Error when trying to use SimpleDirectoryReader with GFS. This was previously working in 0.10.58 ( I know.. a while ago). We recently updated to 0.12.17.

Looking through the change log, error type etc. it appears this particular error may be linked with https://github.com/run-llama/llama_index/pull/17724 which was fixing https://github.com/run-llama/llama_index/issues/17715.

Version

0.12.17

Steps to Reproduce

  1. Get the GFS from a .json service account
with open("key.json") as f:
    service_account_info = json.load(f)
print("Our key.json service account:", service_account_info['client_email'])

# Create a fresh GCSFileSystem with the complete service account info
fs = GCSFileSystem(
    project=service_account_info['project_id'],
    token={
        'type': 'service_account',
        'client_email': service_account_info['client_email'],
        'private_key': service_account_info['private_key'],
        'private_key_id': service_account_info['private_key_id'],
        'project_id': service_account_info['project_id'],
        'token_uri': service_account_info['token_uri']
    }
)
  1. Try to get the files from the reader, this is going to be the gcs_uri with the "gs://" removed. Redacted files for privacy.
reader = SimpleDirectoryReader(
    input_files=files,
    fs=fs
)

Relevant Logs/Tracbacks

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[55], line 1
----> 1 reader.load_data()

File /opt/homebrew/Caskroom/miniforge/base/envs/gen66-updates/lib/python3.9/site-packages/llama_index/core/readers/file/base.py:722, in SimpleDirectoryReader.load_data(self, show_progress, num_workers, fs)
    717         files_to_process = tqdm(
    718             self.input_files, desc="Loading files", unit="file"
    719         )
    720     for input_file in files_to_process:
    721         documents.extend(
--> 722             SimpleDirectoryReader.load_file(
    723                 input_file=input_file,
    724                 file_metadata=self.file_metadata,
    725                 file_extractor=self.file_extractor,
    726                 filename_as_id=self.filename_as_id,
    727                 encoding=self.encoding,
    728                 errors=self.errors,
    729                 raise_on_error=self.raise_on_error,
    730                 fs=fs,
    731             )
    732         )
    734 return self._exclude_metadata(documents)

File /opt/homebrew/Caskroom/miniforge/base/envs/gen66-updates/lib/python3.9/site-packages/llama_index/core/readers/file/base.py:546, in SimpleDirectoryReader.load_file(input_file, file_metadata, file_extractor, filename_as_id, encoding, errors, raise_on_error, fs)
    543 documents: list[Document] = []
    545 if file_metadata is not None:
--> 546     metadata = file_metadata(str(input_file))
    548 file_suffix = input_file.suffix.lower()
    549 if file_suffix in default_file_reader_suffix or file_suffix in file_extractor:
    550     # use file readers

File /opt/homebrew/Caskroom/miniforge/base/envs/gen66-updates/lib/python3.9/site-packages/llama_index/core/readers/file/base.py:178, in _DefaultFileMetadataFunc.__call__(self, file_path)
    177 def __call__(self, file_path: str) -> dict:
--> 178     return default_file_metadata_func(file_path, self.fs)

File /opt/homebrew/Caskroom/miniforge/base/envs/gen66-updates/lib/python3.9/site-packages/llama_index/core/readers/file/base.py:148, in default_file_metadata_func(file_path, fs)
    145     file_name = os.path.basename(file_path)
    147 creation_date = _format_file_timestamp(stat_result.get("created"))
--> 148 last_modified_date = _format_file_timestamp(stat_result.get("mtime"))
    149 last_accessed_date = _format_file_timestamp(stat_result.get("atime"))
    150 default_meta = {
    151     "file_path": file_path,
    152     "file_name": file_name,
   (...)
    157     "last_accessed_date": last_accessed_date,
    158 }

File /opt/homebrew/Caskroom/miniforge/base/envs/gen66-updates/lib/python3.9/site-packages/llama_index/core/readers/file/base.py:124, in _format_file_timestamp(timestamp, include_time)
    121 if timestamp is None:
    122     return None
--> 124 timestamp_dt = datetime.fromtimestamp(timestamp, tz=timezone.utc)
    125 if include_time:
    126     return timestamp_dt.strftime("%Y-%m-%dT%H:%M:%SZ")

TypeError: an integer is required (got type datetime.datetime)

TheMellyBee avatar Feb 24 '25 21:02 TheMellyBee

The TypeError you're encountering is due to a mismatch in the expected input type for the _format_file_timestamp function. This function expects a float representing a timestamp, but it seems to be receiving a datetime.datetime object instead. This issue likely stems from changes made in pull request #17724, which addressed timezone handling in metadata but may have inadvertently introduced this regression.

To resolve this issue, you should ensure that the timestamps being passed to _format_file_timestamp are in the correct format (i.e., as float timestamps). You might need to adjust the code where the stat_result is being processed to convert any datetime.datetime objects to timestamps using the timestamp() method. Here's a potential fix:

# Convert datetime to timestamp if necessary
creation_date = _format_file_timestamp(stat_result.get("created").timestamp() if isinstance(stat_result.get("created"), datetime) else stat_result.get("created"))
last_modified_date = _format_file_timestamp(stat_result.get("mtime").timestamp() if isinstance(stat_result.get("mtime"), datetime) else stat_result.get("mtime"))
last_accessed_date = _format_file_timestamp(stat_result.get("atime").timestamp() if isinstance(stat_result.get("atime"), datetime) else stat_result.get("atime"))

This adjustment should help ensure that the correct data type is passed to _format_file_timestamp, preventing the TypeError.

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other

dosubot[bot] avatar Feb 24 '25 21:02 dosubot[bot]

We did have a patch around this, but it would be nice if this worked as expected for GFS

TheMellyBee avatar Feb 26 '25 13:02 TheMellyBee

def custom_file_metadata(file_path):
    fs = GCSFileSystem()
    stat_result = fs.stat(file_path)
    metadata = {
        "file_path": file_path,
        "file_name": file_path.split("/")[-1],
    }
    for key in ["created", "mtime", "atime"]:
        value = stat_result.get(key)
        if isinstance(value, datetime):
            metadata[f"{key}_date"] = value.strftime("%Y-%m-%dT%H:%M:%SZ")
            metadata[key] = int(value.timestamp())
        else:
            metadata[key] = value
    return metadata 

TheMellyBee avatar Feb 26 '25 15:02 TheMellyBee

The same error occurs with GCSReader when loading data / resource.

ERROR:llama_index.readers.gcs.base:Error loading resource from GCS: 'datetime.datetime' object cannot be interpreted as an integer

TypeError Traceback (most recent call last) in <cell line: 2>() 1 obj = reader.list_resources()[0] ----> 2 reader.load_resource(obj)

5 frames /usr/local/lib/python3.10/dist-packages/llama_index/core/readers/file/base.py in _format_file_timestamp(timestamp, include_time) 122 return None 123 --> 124 timestamp_dt = datetime.fromtimestamp(timestamp, tz=timezone.utc) 125 if include_time: 126 return timestamp_dt.strftime("%Y-%m-%dT%H:%M:%SZ")

TypeError: 'datetime.datetime' object cannot be interpreted as an integer

vladipako avatar Mar 03 '25 00:03 vladipako

Looks like replicating it is as simple as:

GCSReader(bucket="your--bucket", prefix="your-prefix").load_data()

This is a really big fail - any news on patching this up?

jareks avatar Mar 03 '25 13:03 jareks

All of you have the ability to contribute a PR if its urgent, fyi :)

I don't have access to GCS to test a fix, but the fix itself seems pretty simple. I'll take a stab at it

logan-markewich avatar Mar 07 '25 17:03 logan-markewich

Yeah sorry @logan-markewich I totally would if I could get a free moment :( Plus I admit the patch is working fine, but would help others if it was in.

TheMellyBee avatar Mar 17 '25 16:03 TheMellyBee