byaldi Add metadata filtering support and fix multi-document and metadata issue

This PR includes the following changes

If we pass multiple metadata for each document in a folder, the add_to_index will error out because in the below code, it tries to index the ith element, but it is a dictionary, hence it will raise KeyError. Fixed it by renaming metadata as doc_metadata and removed the indexing

for i, item in enumerate(input_items):
     current_doc_id = doc_ids[i] if doc_ids else self.highest_doc_id + 1 + i
     current_metadata = metadata[i] if metadata else None

Added support for filtering based on metadata, so that users can pass a filter_metadata field in the search and get ColPali will only search from these documents.

results = model.search(query, k=5,filter_metadata={"file_name":"attention.pdf"})


print(f"Search results for '{query}':")
for result in results:
   print(f"Doc ID: {result.doc_id}, Page: {result.page_num}, Score: {result.score}")

Results

Search results for 'what's the BLEU score of this new strange method?':
Doc ID: 4, Page: 9, Score: 19.75
Doc ID: 4, Page: 8, Score: 19.5
Doc ID: 4, Page: 11, Score: 17.75
Doc ID: 4, Page: 1, Score: 17.625
Doc ID: 4, Page: 14, Score: 17.375

Now if we check doc ID to metadata

model.model.doc_id_to_metadata

Results

{0: {'file_name': 'attention_table.png'},
 1: {'file_name': 'product_c.png'},
 2: {'file_name': 'financial_report.pdf'},
 3: {'file_name': 'attention_with_a_mustache.pdf'},
 4: {'file_name': 'attention.pdf'}}

It only pulled from Doc ID 4, which we intended.

Please let me know about your suggestions. Looking forward to your collaboration

Sep 17 '24 08:09 Athe-kunal

@bclavie Let me know about the PR, thanks

Sep 17 '24 08:09 Athe-kunal

Hey, thanks for this, this PR is helpful!

I have just a couple change requests

The policy is to never deprecate/modify an API, unless it is absolutely critical. As such, could you revert the changes so doc_metadata is back to metadata? The slight added clarity isn't worth making a breaking change.
Could you update the code to be compatible with the rest of the updated codebase (using the newer colpali engine)? Apologies for this, things move fast and the backend API changed slightly for which we acommodated!

I'll play with this further in the next few days just to make sure nothing's broken. Thank you again!

Sep 23 '24 09:09 bclavie

@bclavie @Athe-kunal I was trying to use byaldi for my project and realized that this PR still has an issue with the metadata when there are multiple documents.

In line 340, doc_metadata = metadata[doc_id] if metadata else None -> We already get the metadata associated with the given doc_id which is a dictionary.

In line 406, we once again do current_metadata = metadata[i] if metadata else None -> This will give a key error.

This happens specifically in v0.0.4

Sep 28 '24 01:09 nuschandra

Hi @nuschandra and @bclavie Sorry had a busy week. I got back to this issue and will implement the changes

Sep 28 '24 06:09 Athe-kunal

@bclavie I have addressed the PR comments, and also updated the tutorial quick_overview.ipynb for metadata based filtering. Looking forward to your suggestions.

Sep 28 '24 09:09 Athe-kunal

@nuschandra The bug has been fixed in my current fork. Hoping to merge it soon. Thanks

Sep 28 '24 09:09 Athe-kunal

thanks @Athe-kunal. My apologies for the late review. LGTM at this stage, I'll run some tests and merge afterwards!

Nov 07 '24 08:11 bclavie

LGTM, works locally. Merging and adding into the next release, thank you so much @Athe-kunal

Nov 13 '24 07:11 bclavie

That's great! Thanks @bclavie for building this library, it is a great time-saver for my project

Nov 13 '24 07:11 Athe-kunal

Hi there, I'm using byaldi (good job!) installed with poetry.

I get an error making this search: results = RAG.search(query, k=10, filter_metadata={"type": "Summaries"}). The error is: AttributeError: 'list' object has no attribute 'items', and it is raised by this code:

for metadata_key,metadata_value in metadata_dict.items():
    ...

which is in byaldi/colpali.py at line 598

I'm creating the index with this function:

def make_index(input_dir, index_name):
    first_subdir = True

    for root, _, files in os.walk(input_dir):
        pdf_files = [f for f in files if f.endswith('.pdf')]
        if pdf_files:
            subdir_name = os.path.basename(root)
            metadata = [{"company": os.path.basename(input_dir), "type": subdir_name} for _ in range(len(pdf_files))]
            
            try:
                if first_subdir:
                    RAG.index(
                        input_path=root,
                        index_name=index_name,
                        metadata=metadata,
                        overwrite=True
                    )
                    first_subdir = False
                else:
                    RAG.add_to_index(
                        input_item=root,
                        store_collection_with_index=False,
                        metadata=metadata,
                    )
            except Exception as e:
                print(f"Error processing directory {root}: {e}")

In the README, it seems to me that metadata are passed as list, so I don't understand where I'm wrong.

Thanks!

Dec 13 '24 18:12 fvisconti