LLPhant icon indicating copy to clipboard operation
LLPhant copied to clipboard

Small to big retrieval - part 1

Open f-lombardo opened this issue 1 year ago • 3 comments

Hi @MaximeThoonsen and @synio-wesley ,

this is a first idea toward the implementation of Small to Big retrieval. (https://github.com/theodo-group/LLPhant/issues/179)

Many things are still missing, first of all a new RetrievedDocumentsTransformer to remove duplicated documents, remove overlappings and create a union document from the retrieved chunks.

I'm not sure about the signature of DocumentStore::fetchDocumentsByChunkRange. I don't know if the hash of the document should also be added to the parameters besides $sourceType and $sourceName, since it's not mandatory that those two parameter identify a document uniquely: https://github.com/theodo-group/LLPhant/blob/25fc657580871f46a6bfc3c4a269cc93d1a65f3c/src/Embeddings/Document.php#L18

f-lombardo avatar Aug 05 '24 09:08 f-lombardo

When I'm back from holidays I can compare this with what I have to give some feedback

synio-wesley avatar Aug 05 '24 10:08 synio-wesley

Hey @synio-wesley and @MaximeThoonsen , would you like to talk about this PR?

f-lombardo avatar Sep 07 '24 21:09 f-lombardo

sorry @f-lombardo I got confused on this one + lot of work right now. Is it ready?

MaximeThoonsen avatar Oct 03 '24 17:10 MaximeThoonsen

@MaximeThoonsen I'm trying to create a more complete implementation, so I changed this PR to draft. If you have time (but you probabilly don't :-) ) you can have a look at how I'm evolving the solution. I'd also interested in @synio-wesley opinion.

f-lombardo avatar Oct 06 '24 21:10 f-lombardo

This is finally ready for a first review. Docs and more example are still missing. @MaximeThoonsen @synio-wesley what's your opinion?

f-lombardo avatar Oct 09 '24 22:10 f-lombardo

@f-lombardo look nice. You don't use a parent-child relation but you use a windows context to also return what was before and after the embedding which is matched right?

MaximeThoonsen avatar Oct 11 '24 13:10 MaximeThoonsen

@f-lombardo look nice. You don't use a parent-child relation but you use a windows context to also return what was before and after the embedding which is matched right?

Yes, this is the main idea of this PR, since IMHO it's easier to implement and to understand.

f-lombardo avatar Oct 11 '24 16:10 f-lombardo

Very cool @f-lombardo

MaximeThoonsen avatar Oct 11 '24 22:10 MaximeThoonsen