dspy
dspy copied to clipboard
Colbert local mode support both as retriever and reranker.
Here are the changes proposed here
- Colbert as a local retriever
- Colbert as a local reranker
- Not treating reranking as first-class citizen, rather added RetrieveThenRerank apart from Retrieve for retrieving and reranking
- Attached a jupyter notebook for showing the implementation details
Fixes and Features in the PR
The notebook to follow with all the changes is here
- Here, the retriever for retrieveEnsemble and retrieve, combines the passages from all the queries into one list. It is erroneous, as the user who is passing multiple queries, expects that he/she will get the relevant passages for each query in the queries list. In the PR, I have fixed it to return a prediction object or prediction objects in a list where each prediction object contains relevant passages for the given query
- Also, the retriever only returns the text without metadata. The metadata will be helpful in downstream tasks of source citations. The metadata return support was added, but the rest of the pipeline won't change. Example of current output
[Prediction(
pid=[6, 48, 74, 47, 33],
rerank_score=[15.8359375, 14.2109375, 12.5703125, 11.7890625, 9.1796875],
passages=['The best things in life are free.', 'Patience is a virtue.', 'To be or not to be, that is the question.', 'Out of sight, out of mind.', 'No pain, no gain.']
),
Prediction(
pid=[33, 0, 47, 74, 16],
rerank_score=[19.828125, 12.2890625, 11.171875, 9.09375, 6.8984375],
passages=['No pain, no gain.', "It's a piece of cake.", 'Out of sight, out of mind.', 'To be or not to be, that is the question.', 'Keep your friends close and your enemies closer.']
)]
- Currently, the reranker will not work in retrieveEnsemble as it is expecting the query and the passages. But there is no reranker support. In this PR, I have added support for Colbert as a reranker. An example of it below
colbert_config = ColBERTConfig()
colbert_config.index_name = 'colbert-ir-index'
colbert_reranker = dspy.ColBERTv2RerankerLocal(
checkpoint='colbert-ir/colbertv2.0',colbert_config=colbert_config)
dspy.settings.configure(rm=colbert_retriever,reranker=colbert_reranker)
retrieve_rerank = dspy.RetrieveThenRerank(k=5)
pred = retrieve_rerank(
["What is the meaning of life?","Meaning of pain?"]
)
The RetrieveThenRerank
will first retrieve, and then from those passages will rerank using the Max-sim operator in ColBERT. Other reranker integrations will be next. The idea about RetrieveThenRerank was mentioned as a TO-DO here.
@okhat, @arnavsinghvi11 @CShorten. Please review the PR and suggest feedbacks on improving it.
very nice work 👏🏻👏🏻
Thanks @Josephrp
Thanks a lot for these additions @Athe-kunal ! Really appreciate the tutorial notebook! I've left some comments for the PR to mainly address some code cleanup changes.
It would also be great if you could add some documentation for the local ColBERT models to the Retrieval documentation. You can follow the existing ColBERTv2 docs for reference and it would be great to add more specifics on the user parameters for interacting with the RM and potentially some of the relevant implementation details. Thanks!
Thanks a ton, @arnavsinghvi11 for this detailed feedback on my PR. It certainly helps me to be a better contributor. I have made the changes and also addressed some confusion, please o help me out there. I will work on the documentation changes for Colbert. Looking forward to more collaboration
@arnavsinghvi11 I have added the documentation for COlbert. Please do review this and suggest edits
@arnavsinghvi11 Can you please review it?
Hi @arnavsinghvi11 It has been a while, can you review the changes here?
Hi @Athe-kunal , just left some follow up comments. it seems that there are still some leftover comments to address. feel free to discuss them more directly here as needed.
Hi @arnavsinghvi11 I have made the required changes and resolved some issues. Please let me know if this is good to merge.
Great PR! Hope it gets merged soon, I find myself needing this.
@arnavsinghvi11 Can you review this again?
Hi @Athe-kunal , thanks again. I'd love to merge this PR but unfortunately, it breaks some existing caches we have in the repository, particularly the intro.ipynb. For more clarity, the changes to search.py and retrieve.py break the cached retrieved outputs in the intro notebook (which we keep fully cached so users can interact with the DSPy tutorial without having to expend their API key credits).
If you could make changes to this PR that enable the existing search/retrieve functions as I mentioned earlier, while introducing the new behavior separately, I can merge it after that. lmk if this makes sense!
Thanks for the suggestion @arnavsinghvi11 I had a query regarding the joblib memory cache. In the current implementation, we have a list of text (only the relevant context) which is being cached. But in my implementation, I am returning a prediction object. Can joblib cache any abstracted data type or just text? Also, for the local Colbertv2. Should I integrate the caching mechanism? It is a local model, and it does not require us to send the request to an API. Can you help me on the PR @arnavsinghvi11?
Hi @arnavsinghvi11 Can you help me with the caching functionality here for the PR? I am unable to understand the caching mechanism here.
Hi @Athe-kunal ,
I would firstly recommend keeping the search
and retrieve
abstractions with as minimal differences as possible. I think I mentioned this in an earlier review but the changes to retrieve
, retrieveRerankEnsemble
, retrieveEnsemble
, and the forward
function in Retrieve
directly impact existing caches with the intro.ipynb
notebook. Any changes to those make it difficult to merge PRs without having to rerun all the requests with the new changes (which we highly prefer not to do to maintain consistency).
In more simple terms, the ColBERTv2
pipeline with search
and retrieve
should work as-is, and you can test this with your PR changes by running the intro notebook and checking if any changes added result in the notebook failing to execute.
Let me know if this is clear as it's not a caching problem to implement in this PR, but rather to maintain existing behavior that will ensure existing caches work!
Some steps I suggest with this is reverting back to the original behavior of search and retrieve and then integrating ColBERTv2RetrieverLocal
and ColBERTv2RerankerLocal
to work with those untouched functions. If you follow the existing setup of ColBERTv2
with its corresponding return types, I think this can work. Once that's done, feel free to ping me back on the PR so we can take a look and ensure no existing caches break.
From there, the next step would be to mirror the existing colbertv2 cache functions to align with the local mode being introduced.
Let me know if this all makes sense!
Hi @arnavsinghvi11 Thanks for your detailed description. I had one small query; the Colbert retriever is a local model. The caching is essential if we are sending to a third party API. But if it is a local model, then does it require caching? If users expose the colbert model to a server, then they can directly use previous ColbertRetriever which has caching. Please let me know if I am missing something.
But if it is a local model, then does it require caching?
Not a direct answer, but HFClientTGI supports caching for models hosted locally (REST API). Maybe it would benefit here as well. But caching is not a requirement with this PR. Ensuring backward compatibility with the existing dspy.ColBERTv2
in addition to supporting local ColBERT is!
Got it @arnavsinghvi11, working on it
I thought that users can first create a colbertv2 local index or reranker first, and then expose it to a server and then they can use the existing Colbertv2 functionality (which supports caching). However, exposing local models to caching can also be helpful. I provided a notebook for these Colbertv2 retrievers and reranker, and the search
and retrieve
functions were working properly.
Also, if you don't mind, can you share your discord handle on the DSPy discord channel, it would be helpful to interact there too.
Hi @arnavsinghvi11
I tested with the intro.ipynb
file, and it worked without hiccups.
Can you respond to the above requests?
Hi @Athe-kunal , I don't see any changes to the PR. could you check if they are pushed? my username is bigman11 btw
Hi @arnavsinghvi11 I have made the following changes
- Added separate functions for retrieving with metadata so that existing cache won't break. I am passing
with_metadata
parameter, which has a default value ofFalse
, thus the current tutorials would work fine with it - Also, I added a
by_prob
parameter with default value ofTrue
todspy.Retrieve
. It turns out thatdsp.retrieve
needs this by_prob parameter, but it is being passed as a kwargs parameter.
This PR has become way too much intractable, I will work on ColbertLocal caching in another PR. Can you review this one @arnavsinghvi11 ?
Thanks @Athe-kunal for your patience with this PR! The changes no longer break the existing intro.ipynb
caches and is good to merge. Did you want to add all these changes to the separate PR, or should I go ahead and squash+merge the changes?
Hi @arnavsinghvi11 Thanks for your support and guidance throughout the PR process, it was a great learning experience. You can squash and merge the changes for this PR. I will work on the caching for the new Colbert Local models this weekend, but for now you can merge this one
Thanks again for helping me in this PR.
Merged. Thanks again @Athe-kunal !