dspy icon indicating copy to clipboard operation
dspy copied to clipboard

Colbert local mode support both as retriever and reranker.

Open Athe-kunal opened this issue 4 months ago • 20 comments

Here are the changes proposed here

  1. Colbert as a local retriever
  2. Colbert as a local reranker
  3. Not treating reranking as first-class citizen, rather added RetrieveThenRerank apart from Retrieve for retrieving and reranking
  4. Attached a jupyter notebook for showing the implementation details

Athe-kunal avatar Apr 10 '24 01:04 Athe-kunal

Fixes and Features in the PR

The notebook to follow with all the changes is here

  1. Here, the retriever for retrieveEnsemble and retrieve, combines the passages from all the queries into one list. It is erroneous, as the user who is passing multiple queries, expects that he/she will get the relevant passages for each query in the queries list. In the PR, I have fixed it to return a prediction object or prediction objects in a list where each prediction object contains relevant passages for the given query
  2. Also, the retriever only returns the text without metadata. The metadata will be helpful in downstream tasks of source citations. The metadata return support was added, but the rest of the pipeline won't change. Example of current output
[Prediction(
     pid=[6, 48, 74, 47, 33],
     rerank_score=[15.8359375, 14.2109375, 12.5703125, 11.7890625, 9.1796875],
     passages=['The best things in life are free.', 'Patience is a virtue.', 'To be or not to be, that is the question.', 'Out of sight, out of mind.', 'No pain, no gain.']
 ),
 Prediction(
     pid=[33, 0, 47, 74, 16],
     rerank_score=[19.828125, 12.2890625, 11.171875, 9.09375, 6.8984375],
     passages=['No pain, no gain.', "It's a piece of cake.", 'Out of sight, out of mind.', 'To be or not to be, that is the question.', 'Keep your friends close and your enemies closer.']
 )]
  1. Currently, the reranker will not work in retrieveEnsemble as it is expecting the query and the passages. But there is no reranker support. In this PR, I have added support for Colbert as a reranker. An example of it below
colbert_config = ColBERTConfig()
colbert_config.index_name = 'colbert-ir-index'
colbert_reranker = dspy.ColBERTv2RerankerLocal(
    checkpoint='colbert-ir/colbertv2.0',colbert_config=colbert_config)

dspy.settings.configure(rm=colbert_retriever,reranker=colbert_reranker)

retrieve_rerank = dspy.RetrieveThenRerank(k=5)

pred = retrieve_rerank(
    ["What is the meaning of life?","Meaning of pain?"]
)

The RetrieveThenRerank will first retrieve, and then from those passages will rerank using the Max-sim operator in ColBERT. Other reranker integrations will be next. The idea about RetrieveThenRerank was mentioned as a TO-DO here.

@okhat, @arnavsinghvi11 @CShorten. Please review the PR and suggest feedbacks on improving it.

Athe-kunal avatar Apr 10 '24 06:04 Athe-kunal

very nice work 👏🏻👏🏻

Thanks @Josephrp

Athe-kunal avatar Apr 10 '24 20:04 Athe-kunal

Thanks a lot for these additions @Athe-kunal ! Really appreciate the tutorial notebook! I've left some comments for the PR to mainly address some code cleanup changes.

It would also be great if you could add some documentation for the local ColBERT models to the Retrieval documentation. You can follow the existing ColBERTv2 docs for reference and it would be great to add more specifics on the user parameters for interacting with the RM and potentially some of the relevant implementation details. Thanks!

arnavsinghvi11 avatar Apr 13 '24 01:04 arnavsinghvi11

Thanks a ton, @arnavsinghvi11 for this detailed feedback on my PR. It certainly helps me to be a better contributor. I have made the changes and also addressed some confusion, please o help me out there. I will work on the documentation changes for Colbert. Looking forward to more collaboration

Athe-kunal avatar Apr 13 '24 20:04 Athe-kunal

@arnavsinghvi11 I have added the documentation for COlbert. Please do review this and suggest edits

Athe-kunal avatar Apr 17 '24 02:04 Athe-kunal

@arnavsinghvi11 Can you please review it?

Athe-kunal avatar Apr 21 '24 04:04 Athe-kunal

Hi @arnavsinghvi11 It has been a while, can you review the changes here?

Athe-kunal avatar Apr 28 '24 03:04 Athe-kunal

Hi @Athe-kunal , just left some follow up comments. it seems that there are still some leftover comments to address. feel free to discuss them more directly here as needed.

arnavsinghvi11 avatar Apr 28 '24 17:04 arnavsinghvi11

Hi @arnavsinghvi11 I have made the required changes and resolved some issues. Please let me know if this is good to merge.

Athe-kunal avatar Apr 29 '24 06:04 Athe-kunal

Great PR! Hope it gets merged soon, I find myself needing this.

ahmed-moubtahij avatar May 04 '24 16:05 ahmed-moubtahij

@arnavsinghvi11 Can you review this again?

Athe-kunal avatar May 04 '24 19:05 Athe-kunal

Hi @Athe-kunal , thanks again. I'd love to merge this PR but unfortunately, it breaks some existing caches we have in the repository, particularly the intro.ipynb. For more clarity, the changes to search.py and retrieve.py break the cached retrieved outputs in the intro notebook (which we keep fully cached so users can interact with the DSPy tutorial without having to expend their API key credits).

If you could make changes to this PR that enable the existing search/retrieve functions as I mentioned earlier, while introducing the new behavior separately, I can merge it after that. lmk if this makes sense!

arnavsinghvi11 avatar May 04 '24 23:05 arnavsinghvi11