Small to big retrieval - part 1
Hi @MaximeThoonsen and @synio-wesley ,
this is a first idea toward the implementation of Small to Big retrieval. (https://github.com/theodo-group/LLPhant/issues/179)
Many things are still missing, first of all a new RetrievedDocumentsTransformer to remove duplicated documents, remove overlappings and create a union document from the retrieved chunks.
I'm not sure about the signature of DocumentStore::fetchDocumentsByChunkRange. I don't know if the hash of the document should also be added to the parameters besides $sourceType and $sourceName, since it's not mandatory that those two parameter identify a document uniquely:
https://github.com/theodo-group/LLPhant/blob/25fc657580871f46a6bfc3c4a269cc93d1a65f3c/src/Embeddings/Document.php#L18
When I'm back from holidays I can compare this with what I have to give some feedback
Hey @synio-wesley and @MaximeThoonsen , would you like to talk about this PR?
sorry @f-lombardo I got confused on this one + lot of work right now. Is it ready?
@MaximeThoonsen I'm trying to create a more complete implementation, so I changed this PR to draft. If you have time (but you probabilly don't :-) ) you can have a look at how I'm evolving the solution. I'd also interested in @synio-wesley opinion.
This is finally ready for a first review. Docs and more example are still missing. @MaximeThoonsen @synio-wesley what's your opinion?
@f-lombardo look nice. You don't use a parent-child relation but you use a windows context to also return what was before and after the embedding which is matched right?
@f-lombardo look nice. You don't use a parent-child relation but you use a windows context to also return what was before and after the embedding which is matched right?
Yes, this is the main idea of this PR, since IMHO it's easier to implement and to understand.
Very cool @f-lombardo