llama_index icon indicating copy to clipboard operation
llama_index copied to clipboard

SimpleDirectoryReader returns only a subset of documents if file_metadata is specified

Open qtangs opened this issue 1 year ago • 0 comments

When file_metadata is added to SimpleDirectoryReader, the number of documents returned is capped to the number of files loaded even if each file could have been split into many documents.

    logging.getLogger().setLevel(logging.DEBUG)

    documents = SimpleDirectoryReader(
        directory_path,
        recursive=True,
        file_metadata=get_file_metadata
    ).load_data()
    logger.info(f"Number of documents loaded: {len(documents)}")

Expected:

DEBUG:root:> [SimpleDirectoryReader] Total files added: 74
INFO:root:Number of documents loaded: 451

Actual:

DEBUG:root:> [SimpleDirectoryReader] Total files added: 74
INFO:root:Number of documents loaded: 74

Cause:

Error is due to this code, size of metadata_list is number of files, but size of data_list is number of documents, which is bigger.

https://github.com/jerryjliu/gpt_index/blob/e34ab9eafbf24419458b50f89d678f9d575de92f/gpt_index/readers/file/base.py#L164-L165

qtangs avatar Mar 12 '23 05:03 qtangs