LLPhant Small to big retrieval

Hi @MaximeThoonsen and @synio-wesley ,

this is a first idea toward the implementation of Small to Big retrieval. (https://github.com/theodo-group/LLPhant/issues/179)

Many things are still missing, first of all a new RetrievedDocumentsTransformer to remove duplicated documents, remove overlappings and create a union document from the retrieved chunks.

I'm not sure about the signature of DocumentStore::fetchDocumentsByChunkRange. I don't know if the hash of the document should also be added to the parameters besides $sourceType and $sourceName, since it's not mandatory that those two parameter identify a document uniquely: https://github.com/theodo-group/LLPhant/blob/25fc657580871f46a6bfc3c4a269cc93d1a65f3c/src/Embeddings/Document.php#L18

Aug 05 '24 09:08 f-lombardo

When I'm back from holidays I can compare this with what I have to give some feedback

Aug 05 '24 10:08 synio-wesley

Hey @synio-wesley and @MaximeThoonsen , would you like to talk about this PR?

Sep 07 '24 21:09 f-lombardo

sorry @f-lombardo I got confused on this one + lot of work right now. Is it ready?

Oct 03 '24 17:10 MaximeThoonsen

@MaximeThoonsen I'm trying to create a more complete implementation, so I changed this PR to draft. If you have time (but you probabilly don't :-) ) you can have a look at how I'm evolving the solution. I'd also interested in @synio-wesley opinion.

Oct 06 '24 21:10 f-lombardo

This is finally ready for a first review. Docs and more example are still missing. @MaximeThoonsen @synio-wesley what's your opinion?

Oct 09 '24 22:10 f-lombardo

@f-lombardo look nice. You don't use a parent-child relation but you use a windows context to also return what was before and after the embedding which is matched right?

Oct 11 '24 13:10 MaximeThoonsen

@f-lombardo look nice. You don't use a parent-child relation but you use a windows context to also return what was before and after the embedding which is matched right?

Yes, this is the main idea of this PR, since IMHO it's easier to implement and to understand.

Oct 11 '24 16:10 f-lombardo

Very cool @f-lombardo

Oct 11 '24 22:10 MaximeThoonsen

Small to big retrieval - part 1