langchain icon indicating copy to clipboard operation
langchain copied to clipboard

Base document loader to retriever

Open leo-gan opened this issue 2 years ago • 3 comments

made BaseLoader a retriever

Warning: this PR is highly controversial! Feel free to reject it without too much consideration. I've added methods of the BaseRetriever to the BaseLoader. That means all loaders, in principle, can be used as retrievers. I've tested it (see a Jupyter notebook example). Everything works as previously, but I have to add max_tokens_limit to this line qa = ConversationalRetrievalChain.from_llm(model,retriever=retriever, max_tokens_limit=4000) otherwise, it fails with the max token exception.

NOTE: Of course, lining fails in many checks on BaseLoader subclasses. I'll not fix them now because this PR will be rejected with a high probability or you give me better implementation ideas. :)) One of those ideas: Maybe using the Protocol class?

UPDATE: after refactoring, it is "not-a-braking-change" anymore. Linting is OK.

Who can review?

@hwchase17 @eyurtsev @dev2049

leo-gan avatar May 14 '23 03:05 leo-gan

thanks @leo-gan ! I'll review tomorrow! Controversial is great for discussion!

eyurtsev avatar May 15 '23 02:05 eyurtsev

@eyurtsev Hey Eugene, I've made it as a "not breaking change" :) Introduced an optional query parameter in BaseLoader. In this way, we don't need to add it to the load (and other methods). If we need a query in method, we can instead do it in two steps: 1) loader_instance.query = query 2) loader_instance.load() (see an example in BaseLoader.get_relevant_documents():L53-55)

leo-gan avatar May 17 '23 00:05 leo-gan

@eyurtsev Hey Eugene, I've made it as a "not breaking change" :) Introduced an optional query parameter in BaseLoader. In this way, we don't need to add it to the load (and other methods). If we need a query in method, we can instead do it in two steps: 1) loader_instance.query = query 2) loader_instance.load() (see an example in BaseLoader.get_relevant_documents():L53-55)

Hah! Yeah I get what you're doing -- the current loaders are causing some awkward downstream patterns when implementing the loader. I don't think we're ready yet to tackle a refactor of the base loader or retriever since there are larger changes that we'll probably want to plan for. I suggest we keep the loader as minimal as possible for now; with a sole responsibility of yield a fully partialled function (aka the load method) which can return a list of documents.

In the other PR I shared, I was trying to come up with some way of introducing another type into the hierarchy to enable compositional use, so we produce minimal changes on the BaseLoader. I don't think the pattern I came up with is good since it's introducing a lot of indirection and a lot of extra instantiation. (Perhaps it might have made more sense to create a loader from a retriever rather than a retriever from a loader.)

We can also keep the arxiv retriever that you had as is for now and collect a few more examples before trying to generalize.

eyurtsev avatar May 17 '23 16:05 eyurtsev

OK, Then I'm closing this PR.

leo-gan avatar May 17 '23 18:05 leo-gan