llama_index
llama_index copied to clipboard
[Bug]: Timedata Type Metadata Failing on GFS (SimpleDirectoryReader)
Bug Description
Error when trying to use SimpleDirectoryReader with GFS. This was previously working in 0.10.58 ( I know.. a while ago). We recently updated to 0.12.17.
Looking through the change log, error type etc. it appears this particular error may be linked with https://github.com/run-llama/llama_index/pull/17724 which was fixing https://github.com/run-llama/llama_index/issues/17715.
Version
0.12.17
Steps to Reproduce
- Get the GFS from a .json service account
with open("key.json") as f:
service_account_info = json.load(f)
print("Our key.json service account:", service_account_info['client_email'])
# Create a fresh GCSFileSystem with the complete service account info
fs = GCSFileSystem(
project=service_account_info['project_id'],
token={
'type': 'service_account',
'client_email': service_account_info['client_email'],
'private_key': service_account_info['private_key'],
'private_key_id': service_account_info['private_key_id'],
'project_id': service_account_info['project_id'],
'token_uri': service_account_info['token_uri']
}
)
- Try to get the files from the reader, this is going to be the gcs_uri with the "gs://" removed. Redacted files for privacy.
reader = SimpleDirectoryReader(
input_files=files,
fs=fs
)
Relevant Logs/Tracbacks
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[55], line 1
----> 1 reader.load_data()
File /opt/homebrew/Caskroom/miniforge/base/envs/gen66-updates/lib/python3.9/site-packages/llama_index/core/readers/file/base.py:722, in SimpleDirectoryReader.load_data(self, show_progress, num_workers, fs)
717 files_to_process = tqdm(
718 self.input_files, desc="Loading files", unit="file"
719 )
720 for input_file in files_to_process:
721 documents.extend(
--> 722 SimpleDirectoryReader.load_file(
723 input_file=input_file,
724 file_metadata=self.file_metadata,
725 file_extractor=self.file_extractor,
726 filename_as_id=self.filename_as_id,
727 encoding=self.encoding,
728 errors=self.errors,
729 raise_on_error=self.raise_on_error,
730 fs=fs,
731 )
732 )
734 return self._exclude_metadata(documents)
File /opt/homebrew/Caskroom/miniforge/base/envs/gen66-updates/lib/python3.9/site-packages/llama_index/core/readers/file/base.py:546, in SimpleDirectoryReader.load_file(input_file, file_metadata, file_extractor, filename_as_id, encoding, errors, raise_on_error, fs)
543 documents: list[Document] = []
545 if file_metadata is not None:
--> 546 metadata = file_metadata(str(input_file))
548 file_suffix = input_file.suffix.lower()
549 if file_suffix in default_file_reader_suffix or file_suffix in file_extractor:
550 # use file readers
File /opt/homebrew/Caskroom/miniforge/base/envs/gen66-updates/lib/python3.9/site-packages/llama_index/core/readers/file/base.py:178, in _DefaultFileMetadataFunc.__call__(self, file_path)
177 def __call__(self, file_path: str) -> dict:
--> 178 return default_file_metadata_func(file_path, self.fs)
File /opt/homebrew/Caskroom/miniforge/base/envs/gen66-updates/lib/python3.9/site-packages/llama_index/core/readers/file/base.py:148, in default_file_metadata_func(file_path, fs)
145 file_name = os.path.basename(file_path)
147 creation_date = _format_file_timestamp(stat_result.get("created"))
--> 148 last_modified_date = _format_file_timestamp(stat_result.get("mtime"))
149 last_accessed_date = _format_file_timestamp(stat_result.get("atime"))
150 default_meta = {
151 "file_path": file_path,
152 "file_name": file_name,
(...)
157 "last_accessed_date": last_accessed_date,
158 }
File /opt/homebrew/Caskroom/miniforge/base/envs/gen66-updates/lib/python3.9/site-packages/llama_index/core/readers/file/base.py:124, in _format_file_timestamp(timestamp, include_time)
121 if timestamp is None:
122 return None
--> 124 timestamp_dt = datetime.fromtimestamp(timestamp, tz=timezone.utc)
125 if include_time:
126 return timestamp_dt.strftime("%Y-%m-%dT%H:%M:%SZ")
TypeError: an integer is required (got type datetime.datetime)
The TypeError you're encountering is due to a mismatch in the expected input type for the _format_file_timestamp function. This function expects a float representing a timestamp, but it seems to be receiving a datetime.datetime object instead. This issue likely stems from changes made in pull request #17724, which addressed timezone handling in metadata but may have inadvertently introduced this regression.
To resolve this issue, you should ensure that the timestamps being passed to _format_file_timestamp are in the correct format (i.e., as float timestamps). You might need to adjust the code where the stat_result is being processed to convert any datetime.datetime objects to timestamps using the timestamp() method. Here's a potential fix:
# Convert datetime to timestamp if necessary
creation_date = _format_file_timestamp(stat_result.get("created").timestamp() if isinstance(stat_result.get("created"), datetime) else stat_result.get("created"))
last_modified_date = _format_file_timestamp(stat_result.get("mtime").timestamp() if isinstance(stat_result.get("mtime"), datetime) else stat_result.get("mtime"))
last_accessed_date = _format_file_timestamp(stat_result.get("atime").timestamp() if isinstance(stat_result.get("atime"), datetime) else stat_result.get("atime"))
This adjustment should help ensure that the correct data type is passed to _format_file_timestamp, preventing the TypeError.
To continue talking to Dosu, mention @dosu.
Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other
We did have a patch around this, but it would be nice if this worked as expected for GFS
def custom_file_metadata(file_path):
fs = GCSFileSystem()
stat_result = fs.stat(file_path)
metadata = {
"file_path": file_path,
"file_name": file_path.split("/")[-1],
}
for key in ["created", "mtime", "atime"]:
value = stat_result.get(key)
if isinstance(value, datetime):
metadata[f"{key}_date"] = value.strftime("%Y-%m-%dT%H:%M:%SZ")
metadata[key] = int(value.timestamp())
else:
metadata[key] = value
return metadata
The same error occurs with GCSReader when loading data / resource.
ERROR:llama_index.readers.gcs.base:Error loading resource from GCS: 'datetime.datetime' object cannot be interpreted as an integer
TypeError Traceback (most recent call last)
5 frames /usr/local/lib/python3.10/dist-packages/llama_index/core/readers/file/base.py in _format_file_timestamp(timestamp, include_time) 122 return None 123 --> 124 timestamp_dt = datetime.fromtimestamp(timestamp, tz=timezone.utc) 125 if include_time: 126 return timestamp_dt.strftime("%Y-%m-%dT%H:%M:%SZ")
TypeError: 'datetime.datetime' object cannot be interpreted as an integer
Looks like replicating it is as simple as:
GCSReader(bucket="your--bucket", prefix="your-prefix").load_data()
This is a really big fail - any news on patching this up?
All of you have the ability to contribute a PR if its urgent, fyi :)
I don't have access to GCS to test a fix, but the fix itself seems pretty simple. I'll take a stab at it
Yeah sorry @logan-markewich I totally would if I could get a free moment :( Plus I admit the patch is working fine, but would help others if it was in.