Add metadata filtering support and fix multi-document and metadata issue
This PR includes the following changes
- If we pass multiple metadata for each document in a folder, the
add_to_indexwill error out because in the below code, it tries to index the ith element, but it is a dictionary, hence it will raiseKeyError. Fixed it by renamingmetadataasdoc_metadataand removed the indexing
for i, item in enumerate(input_items):
current_doc_id = doc_ids[i] if doc_ids else self.highest_doc_id + 1 + i
current_metadata = metadata[i] if metadata else None
- Added support for filtering based on metadata, so that users can pass a
filter_metadatafield in thesearchand get ColPali will only search from these documents.
results = model.search(query, k=5,filter_metadata={"file_name":"attention.pdf"})
print(f"Search results for '{query}':")
for result in results:
print(f"Doc ID: {result.doc_id}, Page: {result.page_num}, Score: {result.score}")
Results
Search results for 'what's the BLEU score of this new strange method?':
Doc ID: 4, Page: 9, Score: 19.75
Doc ID: 4, Page: 8, Score: 19.5
Doc ID: 4, Page: 11, Score: 17.75
Doc ID: 4, Page: 1, Score: 17.625
Doc ID: 4, Page: 14, Score: 17.375
Now if we check doc ID to metadata
model.model.doc_id_to_metadata
Results
{0: {'file_name': 'attention_table.png'},
1: {'file_name': 'product_c.png'},
2: {'file_name': 'financial_report.pdf'},
3: {'file_name': 'attention_with_a_mustache.pdf'},
4: {'file_name': 'attention.pdf'}}
It only pulled from Doc ID 4, which we intended.
Please let me know about your suggestions. Looking forward to your collaboration
@bclavie Let me know about the PR, thanks
Hey, thanks for this, this PR is helpful!
I have just a couple change requests
- The policy is to never deprecate/modify an API, unless it is absolutely critical. As such, could you revert the changes so
doc_metadatais back tometadata? The slight added clarity isn't worth making a breaking change. - Could you update the code to be compatible with the rest of the updated codebase (using the newer colpali engine)? Apologies for this, things move fast and the backend API changed slightly for which we acommodated!
I'll play with this further in the next few days just to make sure nothing's broken. Thank you again!
@bclavie @Athe-kunal I was trying to use byaldi for my project and realized that this PR still has an issue with the metadata when there are multiple documents.
In line 340, doc_metadata = metadata[doc_id] if metadata else None -> We already get the metadata associated with the given doc_id which is a dictionary.
In line 406, we once again do current_metadata = metadata[i] if metadata else None -> This will give a key error.
This happens specifically in v0.0.4
Hi @nuschandra and @bclavie Sorry had a busy week. I got back to this issue and will implement the changes
@bclavie
I have addressed the PR comments, and also updated the tutorial quick_overview.ipynb for metadata based filtering. Looking forward to your suggestions.
@nuschandra The bug has been fixed in my current fork. Hoping to merge it soon. Thanks
thanks @Athe-kunal. My apologies for the late review. LGTM at this stage, I'll run some tests and merge afterwards!
LGTM, works locally. Merging and adding into the next release, thank you so much @Athe-kunal
That's great! Thanks @bclavie for building this library, it is a great time-saver for my project
Hi there, I'm using byaldi (good job!) installed with poetry.
I get an error making this search: results = RAG.search(query, k=10, filter_metadata={"type": "Summaries"}). The error is: AttributeError: 'list' object has no attribute 'items', and it is raised by this code:
for metadata_key,metadata_value in metadata_dict.items():
...
which is in byaldi/colpali.py at line 598
I'm creating the index with this function:
def make_index(input_dir, index_name):
first_subdir = True
for root, _, files in os.walk(input_dir):
pdf_files = [f for f in files if f.endswith('.pdf')]
if pdf_files:
subdir_name = os.path.basename(root)
metadata = [{"company": os.path.basename(input_dir), "type": subdir_name} for _ in range(len(pdf_files))]
try:
if first_subdir:
RAG.index(
input_path=root,
index_name=index_name,
metadata=metadata,
overwrite=True
)
first_subdir = False
else:
RAG.add_to_index(
input_item=root,
store_collection_with_index=False,
metadata=metadata,
)
except Exception as e:
print(f"Error processing directory {root}: {e}")
In the README, it seems to me that metadata are passed as list, so I don't understand where I'm wrong.
Thanks!