byaldi icon indicating copy to clipboard operation
byaldi copied to clipboard

Add metadata filtering support and fix multi-document and metadata issue

Open Athe-kunal opened this issue 1 year ago • 6 comments

This PR includes the following changes

  1. If we pass multiple metadata for each document in a folder, the add_to_index will error out because in the below code, it tries to index the ith element, but it is a dictionary, hence it will raise KeyError. Fixed it by renaming metadata as doc_metadata and removed the indexing
for i, item in enumerate(input_items):
     current_doc_id = doc_ids[i] if doc_ids else self.highest_doc_id + 1 + i
     current_metadata = metadata[i] if metadata else None
  1. Added support for filtering based on metadata, so that users can pass a filter_metadata field in the search and get ColPali will only search from these documents.
results = model.search(query, k=5,filter_metadata={"file_name":"attention.pdf"})


print(f"Search results for '{query}':")
for result in results:
   print(f"Doc ID: {result.doc_id}, Page: {result.page_num}, Score: {result.score}")

Results

Search results for 'what's the BLEU score of this new strange method?':
Doc ID: 4, Page: 9, Score: 19.75
Doc ID: 4, Page: 8, Score: 19.5
Doc ID: 4, Page: 11, Score: 17.75
Doc ID: 4, Page: 1, Score: 17.625
Doc ID: 4, Page: 14, Score: 17.375

Now if we check doc ID to metadata

model.model.doc_id_to_metadata

Results

{0: {'file_name': 'attention_table.png'},
 1: {'file_name': 'product_c.png'},
 2: {'file_name': 'financial_report.pdf'},
 3: {'file_name': 'attention_with_a_mustache.pdf'},
 4: {'file_name': 'attention.pdf'}}

It only pulled from Doc ID 4, which we intended.

Please let me know about your suggestions. Looking forward to your collaboration

Athe-kunal avatar Sep 17 '24 08:09 Athe-kunal

@bclavie Let me know about the PR, thanks

Athe-kunal avatar Sep 17 '24 08:09 Athe-kunal

Hey, thanks for this, this PR is helpful!

I have just a couple change requests

  • The policy is to never deprecate/modify an API, unless it is absolutely critical. As such, could you revert the changes so doc_metadata is back to metadata? The slight added clarity isn't worth making a breaking change.
  • Could you update the code to be compatible with the rest of the updated codebase (using the newer colpali engine)? Apologies for this, things move fast and the backend API changed slightly for which we acommodated!

I'll play with this further in the next few days just to make sure nothing's broken. Thank you again!

bclavie avatar Sep 23 '24 09:09 bclavie

@bclavie @Athe-kunal I was trying to use byaldi for my project and realized that this PR still has an issue with the metadata when there are multiple documents.

In line 340, doc_metadata = metadata[doc_id] if metadata else None -> We already get the metadata associated with the given doc_id which is a dictionary.

In line 406, we once again do current_metadata = metadata[i] if metadata else None -> This will give a key error.

This happens specifically in v0.0.4

nuschandra avatar Sep 28 '24 01:09 nuschandra

Hi @nuschandra and @bclavie Sorry had a busy week. I got back to this issue and will implement the changes

Athe-kunal avatar Sep 28 '24 06:09 Athe-kunal

@bclavie I have addressed the PR comments, and also updated the tutorial quick_overview.ipynb for metadata based filtering. Looking forward to your suggestions.

Athe-kunal avatar Sep 28 '24 09:09 Athe-kunal

@nuschandra The bug has been fixed in my current fork. Hoping to merge it soon. Thanks

Athe-kunal avatar Sep 28 '24 09:09 Athe-kunal

thanks @Athe-kunal. My apologies for the late review. LGTM at this stage, I'll run some tests and merge afterwards!

bclavie avatar Nov 07 '24 08:11 bclavie

LGTM, works locally. Merging and adding into the next release, thank you so much @Athe-kunal

bclavie avatar Nov 13 '24 07:11 bclavie

That's great! Thanks @bclavie for building this library, it is a great time-saver for my project

Athe-kunal avatar Nov 13 '24 07:11 Athe-kunal

Hi there, I'm using byaldi (good job!) installed with poetry.

I get an error making this search: results = RAG.search(query, k=10, filter_metadata={"type": "Summaries"}). The error is: AttributeError: 'list' object has no attribute 'items', and it is raised by this code:

for metadata_key,metadata_value in metadata_dict.items():
    ...

which is in byaldi/colpali.py at line 598

I'm creating the index with this function:

def make_index(input_dir, index_name):
    first_subdir = True

    for root, _, files in os.walk(input_dir):
        pdf_files = [f for f in files if f.endswith('.pdf')]
        if pdf_files:
            subdir_name = os.path.basename(root)
            metadata = [{"company": os.path.basename(input_dir), "type": subdir_name} for _ in range(len(pdf_files))]
            
            try:
                if first_subdir:
                    RAG.index(
                        input_path=root,
                        index_name=index_name,
                        metadata=metadata,
                        overwrite=True
                    )
                    first_subdir = False
                else:
                    RAG.add_to_index(
                        input_item=root,
                        store_collection_with_index=False,
                        metadata=metadata,
                    )
            except Exception as e:
                print(f"Error processing directory {root}: {e}")

In the README, it seems to me that metadata are passed as list, so I don't understand where I'm wrong.

Thanks!

fvisconti avatar Dec 13 '24 18:12 fvisconti