byaldi icon indicating copy to clipboard operation
byaldi copied to clipboard

Unable to add Metadata to index

Open NMVRodrigues opened this issue 5 months ago • 1 comments

When trying to add metadata to an index, either using a list of metadata dicts or a mapping of uid to metadata dict (shown below), it always produces a key error.

Example:

RAG = RAGMultiModalModel.from_pretrained("vidore/colpali", device='cuda:2', verbose=1)

# contains 29 1-page pdfs
files = glob(os.path.join('/dataset_pdf', '*.pdf'))

# generate simple unique ids
uids = list(range(len(files)))

# get the file names
report_ids = [file.split('/')[-1].split('.pdf')[0] for file in files]

metadata = {uids[i]: {'file_name':report_ids[i]} for i in range(len(uids))}

RAG.index(
    input_path='dataset_pdf',
    index_name='Documents', # index will be saved at index_root/index_name/
    doc_ids=uids,
    store_collection_with_index=True,
    overwrite=True,
    metadata=metadata,

)

This produces the following error:

report_ids = [file.split('/')[-1].split('.pdf')[0] for file in files]
metadata = {uids[i]: {'file_name':report_ids[i]} for i in range(len(uids))}
--> RAG.index(
input_path='dataset_pdf',
index_name='Documents', # index will be saved at index_root/index_name/
doc_ids=uids,
store_collection_with_index=True,
overwrite=True,
metadata=metadata,
)

File ~/miniconda3/envs/rag/lib/python3.9/site-packages/byaldi/RAGModel.py:111, in RAGMultiModalModel.index(self, input_path, index_name, doc_ids, store_collection_with_index, overwrite, metadata)
def index(
 self,
 input_path: Union[str, Path],
   (...)
...
-->  current_metadata = metadata[i] if metadata else None
 if current_doc_id in self.doc_ids:
 raise ValueError(f"Document ID {current_doc_id} already exists in the index")

KeyError: 0

Removing metadata solves this problem, however, it should be ok based on the metadata docstring from RAGMultiModalModel

NMVRodrigues avatar Sep 11 '24 23:09 NMVRodrigues