dspy icon indicating copy to clipboard operation
dspy copied to clipboard

Colbert local mode support both as retriever and reranker.

Open Athe-kunal opened this issue 10 months ago • 20 comments

Here are the changes proposed here

  1. Colbert as a local retriever
  2. Colbert as a local reranker
  3. Not treating reranking as first-class citizen, rather added RetrieveThenRerank apart from Retrieve for retrieving and reranking
  4. Attached a jupyter notebook for showing the implementation details

Athe-kunal avatar Apr 10 '24 01:04 Athe-kunal

Fixes and Features in the PR

The notebook to follow with all the changes is here

  1. Here, the retriever for retrieveEnsemble and retrieve, combines the passages from all the queries into one list. It is erroneous, as the user who is passing multiple queries, expects that he/she will get the relevant passages for each query in the queries list. In the PR, I have fixed it to return a prediction object or prediction objects in a list where each prediction object contains relevant passages for the given query
  2. Also, the retriever only returns the text without metadata. The metadata will be helpful in downstream tasks of source citations. The metadata return support was added, but the rest of the pipeline won't change. Example of current output
[Prediction(
     pid=[6, 48, 74, 47, 33],
     rerank_score=[15.8359375, 14.2109375, 12.5703125, 11.7890625, 9.1796875],
     passages=['The best things in life are free.', 'Patience is a virtue.', 'To be or not to be, that is the question.', 'Out of sight, out of mind.', 'No pain, no gain.']
 ),
 Prediction(
     pid=[33, 0, 47, 74, 16],
     rerank_score=[19.828125, 12.2890625, 11.171875, 9.09375, 6.8984375],
     passages=['No pain, no gain.', "It's a piece of cake.", 'Out of sight, out of mind.', 'To be or not to be, that is the question.', 'Keep your friends close and your enemies closer.']
 )]
  1. Currently, the reranker will not work in retrieveEnsemble as it is expecting the query and the passages. But there is no reranker support. In this PR, I have added support for Colbert as a reranker. An example of it below
colbert_config = ColBERTConfig()
colbert_config.index_name = 'colbert-ir-index'
colbert_reranker = dspy.ColBERTv2RerankerLocal(
    checkpoint='colbert-ir/colbertv2.0',colbert_config=colbert_config)

dspy.settings.configure(rm=colbert_retriever,reranker=colbert_reranker)

retrieve_rerank = dspy.RetrieveThenRerank(k=5)

pred = retrieve_rerank(
    ["What is the meaning of life?","Meaning of pain?"]
)

The RetrieveThenRerank will first retrieve, and then from those passages will rerank using the Max-sim operator in ColBERT. Other reranker integrations will be next. The idea about RetrieveThenRerank was mentioned as a TO-DO here.

@okhat, @arnavsinghvi11 @CShorten. Please review the PR and suggest feedbacks on improving it.

Athe-kunal avatar Apr 10 '24 06:04 Athe-kunal

very nice work 👏🏻👏🏻

Thanks @Josephrp

Athe-kunal avatar Apr 10 '24 20:04 Athe-kunal

Thanks a lot for these additions @Athe-kunal ! Really appreciate the tutorial notebook! I've left some comments for the PR to mainly address some code cleanup changes.

It would also be great if you could add some documentation for the local ColBERT models to the Retrieval documentation. You can follow the existing ColBERTv2 docs for reference and it would be great to add more specifics on the user parameters for interacting with the RM and potentially some of the relevant implementation details. Thanks!

arnavsinghvi11 avatar Apr 13 '24 01:04 arnavsinghvi11

Thanks a ton, @arnavsinghvi11 for this detailed feedback on my PR. It certainly helps me to be a better contributor. I have made the changes and also addressed some confusion, please o help me out there. I will work on the documentation changes for Colbert. Looking forward to more collaboration

Athe-kunal avatar Apr 13 '24 20:04 Athe-kunal

@arnavsinghvi11 I have added the documentation for COlbert. Please do review this and suggest edits

Athe-kunal avatar Apr 17 '24 02:04 Athe-kunal

@arnavsinghvi11 Can you please review it?

Athe-kunal avatar Apr 21 '24 04:04 Athe-kunal

Hi @arnavsinghvi11 It has been a while, can you review the changes here?

Athe-kunal avatar Apr 28 '24 03:04 Athe-kunal

Hi @Athe-kunal , just left some follow up comments. it seems that there are still some leftover comments to address. feel free to discuss them more directly here as needed.

arnavsinghvi11 avatar Apr 28 '24 17:04 arnavsinghvi11

Hi @arnavsinghvi11 I have made the required changes and resolved some issues. Please let me know if this is good to merge.

Athe-kunal avatar Apr 29 '24 06:04 Athe-kunal

Great PR! Hope it gets merged soon, I find myself needing this.

ahmed-moubtahij avatar May 04 '24 16:05 ahmed-moubtahij

@arnavsinghvi11 Can you review this again?

Athe-kunal avatar May 04 '24 19:05 Athe-kunal

Hi @Athe-kunal , thanks again. I'd love to merge this PR but unfortunately, it breaks some existing caches we have in the repository, particularly the intro.ipynb. For more clarity, the changes to search.py and retrieve.py break the cached retrieved outputs in the intro notebook (which we keep fully cached so users can interact with the DSPy tutorial without having to expend their API key credits).

If you could make changes to this PR that enable the existing search/retrieve functions as I mentioned earlier, while introducing the new behavior separately, I can merge it after that. lmk if this makes sense!

arnavsinghvi11 avatar May 04 '24 23:05 arnavsinghvi11

Thanks for the suggestion @arnavsinghvi11 I had a query regarding the joblib memory cache. In the current implementation, we have a list of text (only the relevant context) which is being cached. But in my implementation, I am returning a prediction object. Can joblib cache any abstracted data type or just text? Also, for the local Colbertv2. Should I integrate the caching mechanism? It is a local model, and it does not require us to send the request to an API. Can you help me on the PR @arnavsinghvi11?

Athe-kunal avatar May 08 '24 20:05 Athe-kunal

Hi @arnavsinghvi11 Can you help me with the caching functionality here for the PR? I am unable to understand the caching mechanism here.

Athe-kunal avatar May 10 '24 18:05 Athe-kunal

Hi @Athe-kunal ,

I would firstly recommend keeping the search and retrieve abstractions with as minimal differences as possible. I think I mentioned this in an earlier review but the changes to retrieve, retrieveRerankEnsemble, retrieveEnsemble, and the forward function in Retrieve directly impact existing caches with the intro.ipynb notebook. Any changes to those make it difficult to merge PRs without having to rerun all the requests with the new changes (which we highly prefer not to do to maintain consistency).

In more simple terms, the ColBERTv2 pipeline with search and retrieve should work as-is, and you can test this with your PR changes by running the intro notebook and checking if any changes added result in the notebook failing to execute. Let me know if this is clear as it's not a caching problem to implement in this PR, but rather to maintain existing behavior that will ensure existing caches work!

Some steps I suggest with this is reverting back to the original behavior of search and retrieve and then integrating ColBERTv2RetrieverLocal and ColBERTv2RerankerLocal to work with those untouched functions. If you follow the existing setup of ColBERTv2 with its corresponding return types, I think this can work. Once that's done, feel free to ping me back on the PR so we can take a look and ensure no existing caches break.

From there, the next step would be to mirror the existing colbertv2 cache functions to align with the local mode being introduced.

Let me know if this all makes sense!

arnavsinghvi11 avatar May 11 '24 18:05 arnavsinghvi11

Hi @arnavsinghvi11 Thanks for your detailed description. I had one small query; the Colbert retriever is a local model. The caching is essential if we are sending to a third party API. But if it is a local model, then does it require caching? If users expose the colbert model to a server, then they can directly use previous ColbertRetriever which has caching. Please let me know if I am missing something.

Athe-kunal avatar May 14 '24 02:05 Athe-kunal

But if it is a local model, then does it require caching?

Not a direct answer, but HFClientTGI supports caching for models hosted locally (REST API). Maybe it would benefit here as well. But caching is not a requirement with this PR. Ensuring backward compatibility with the existing dspy.ColBERTv2 in addition to supporting local ColBERT is!

arnavsinghvi11 avatar May 15 '24 21:05 arnavsinghvi11

Got it @arnavsinghvi11, working on it I thought that users can first create a colbertv2 local index or reranker first, and then expose it to a server and then they can use the existing Colbertv2 functionality (which supports caching). However, exposing local models to caching can also be helpful. I provided a notebook for these Colbertv2 retrievers and reranker, and the search and retrieve functions were working properly.

Also, if you don't mind, can you share your discord handle on the DSPy discord channel, it would be helpful to interact there too.

Athe-kunal avatar May 16 '24 22:05 Athe-kunal

Hi @arnavsinghvi11 I tested with the intro.ipynb file, and it worked without hiccups. Can you respond to the above requests?

Athe-kunal avatar May 21 '24 18:05 Athe-kunal

Hi @Athe-kunal , I don't see any changes to the PR. could you check if they are pushed? my username is bigman11 btw

arnavsinghvi11 avatar May 26 '24 00:05 arnavsinghvi11

Hi @arnavsinghvi11 I have made the following changes

  1. Added separate functions for retrieving with metadata so that existing cache won't break. I am passing with_metadata parameter, which has a default value of False, thus the current tutorials would work fine with it
  2. Also, I added a by_prob parameter with default value of True to dspy.Retrieve. It turns out that dsp.retrieve needs this by_prob parameter, but it is being passed as a kwargs parameter.

This PR has become way too much intractable, I will work on ColbertLocal caching in another PR. Can you review this one @arnavsinghvi11 ?

Athe-kunal avatar Jun 08 '24 03:06 Athe-kunal

Thanks @Athe-kunal for your patience with this PR! The changes no longer break the existing intro.ipynb caches and is good to merge. Did you want to add all these changes to the separate PR, or should I go ahead and squash+merge the changes?

arnavsinghvi11 avatar Jun 13 '24 14:06 arnavsinghvi11

Hi @arnavsinghvi11 Thanks for your support and guidance throughout the PR process, it was a great learning experience. You can squash and merge the changes for this PR. I will work on the caching for the new Colbert Local models this weekend, but for now you can merge this one

Thanks again for helping me in this PR.

Athe-kunal avatar Jun 13 '24 19:06 Athe-kunal

Merged. Thanks again @Athe-kunal !

arnavsinghvi11 avatar Jun 15 '24 18:06 arnavsinghvi11