llama_index
llama_index copied to clipboard
SimpleDirectoryReader returns only a subset of documents if file_metadata is specified
When file_metadata
is added to SimpleDirectoryReader
, the number of documents returned is capped to the number of files loaded even if each file could have been split into many documents.
logging.getLogger().setLevel(logging.DEBUG)
documents = SimpleDirectoryReader(
directory_path,
recursive=True,
file_metadata=get_file_metadata
).load_data()
logger.info(f"Number of documents loaded: {len(documents)}")
Expected:
DEBUG:root:> [SimpleDirectoryReader] Total files added: 74
INFO:root:Number of documents loaded: 451
Actual:
DEBUG:root:> [SimpleDirectoryReader] Total files added: 74
INFO:root:Number of documents loaded: 74
Cause:
Error is due to this code, size of metadata_list
is number of files, but size of data_list
is number of documents, which is bigger.
https://github.com/jerryjliu/gpt_index/blob/e34ab9eafbf24419458b50f89d678f9d575de92f/gpt_index/readers/file/base.py#L164-L165