core icon indicating copy to clipboard operation
core copied to clipboard

Add metadata to `intermediate_summaries` in `get_summary_text` method

Open nicola-corbellini opened this issue 2 years ago • 5 comments

I think it could be useful to be able to define the metadata of

intermediate_summaries = [
        Document(page_content=summary) for summary in intermediate_summaries
         ]

at line 186 in cheshire_cat.py.

They could be both defined passing a dict as argument to the get_summary_function or inherited from the docs that are summarized. What do you think?

nicola-corbellini avatar Apr 21 '23 21:04 nicola-corbellini

Hi 🙌 What would you save there? We insert metadata when we store documents in memory, as you could see here. Does this answer your questions? maybe I didn't get the issue right 😸

calebgcc avatar Apr 22 '23 16:04 calebgcc

@calebgcc @nicola-corbellini if we do it right this can be really powerful. We want:

  • the cat being able to insert docs in memory with metadata (done)
  • doc+metadata should be passed through a hook so plugins can edit the doc and the metadata before the memory gets inserted.
  • same thing happening in retrieval

something like

@hook
def before_insertion_in_vector_memory(doc, cat):
    
    # do stuff with doc.page_content or doc.metadata

    return doc

We define good general defaults in core_plugin and let user plugins do vertical stuff Also in this case we may "pipe" hooks instead of running only the one with highest priority. Such a pipe would allow several plugins to edit metadata in an ordered, cumulative way.

@hook(priority=2, pipe=True)

What do you think? Let's discuss it!

pieroit avatar Apr 22 '23 17:04 pieroit

@calebgcc yep, sorry I could have explained a little further.

I was dealing with a Document not yet in memory and I wanted to use load_qa_with_sources_chain from Langhchain. However, summaries lose their source key .

Also when storing documents in memory, it could be useful to add custom metadata (in my case they would be Journal, URL, Authors of a paper). E.g. this is how I tried to extend the saving in the WorkingMemory draft I'm using:

    def remember(self, ccat, **kwargs):
        # check **kwargs
        for m, mem in enumerate(self.memories):
            _ = ccat.memory.vectors.declarative.add_texts(
                [mem.page_content],
                [
                    {
                        "when": time.time(),
                        "text": mem.page_content,
                        **mem.metadata,  # assuming 'source' is already in metadata
                        **kwargs
                    }
                ],
            )

if you find this can be useful.

Btw, if I have correctly understood what @pieroit suggested, in my case it could become:

@hook
def before_insertion_in_vector_memory(doc, cat):
    
    # process metadata 
    cat.working_memory.remember(cat, author=...,journal=...) 


    return doc

Some changes to remember would be needed then. Does this makes sense?

nicola-corbellini avatar Apr 23 '23 09:04 nicola-corbellini

something like

@hook
def before_insertion_in_vector_memory(doc, cat):
    
    # do stuff with doc.page_content or doc.metadata

    return doc

I'm thinking now, are the doc and the cat enough as parameters? @pieroit The original source name, if it is or is not a summary, from where it was called, those things could be useful to the developer of plugins

calebgcc avatar Apr 23 '23 16:04 calebgcc

Btw, if I have correctly understood what @pieroit suggested, in my case it could become:

@hook
def before_insertion_in_vector_memory(doc, cat):
   
 # process metadata 
   cat.working_memory.remember(cat, author=...,journal=...) 


   return doc

Sorry, I just wanted to correct myself, as the above wouldn't be the correct use neither in my case, nor in general.

nicola-corbellini avatar Apr 24 '23 07:04 nicola-corbellini

something like

@hook
def before_insertion_in_vector_memory(doc, cat):
    
    # do stuff with doc.page_content or doc.metadata

    return doc

I'm thinking now, are the doc and the cat enough as parameters? @pieroit The original source name, if it is or is not a summary, from where it was called, those things could be useful to the developer of plugins

Yes we should find a way to pass more context, maybe directly through the doc metadata.

In case of the rabbit_hole for example, metadata should already contain:

{
    "source": source,
    "is_summary": True
    "when": time.time(),
    "text": doc.page_content
}

As the hook is called just before memory insertion. What do you think? Additional args to the hook is no problem, but I don't know how and what to pass. Maybe it's cleaner to update doc metadata as docs are created and edited.

Since source contains the name of the original file, plugins can extract other info and add author, journal, etc as @nicola-corbellini is doing - before memory insertion. There should be also other hooks to access the whole file, textual content and raw documents. Plugins can store anything they want in a local variable and orchestrate this hooks with information they got form the cat during execution.

pieroit avatar Apr 27 '23 09:04 pieroit

@pieroit I agree on using document metadata to pass context inside the hook. Source, is_summary, when and text are a perfect start.

The only thing that could also be useful, but it's a bit difficult to do, is to give the context of who triggered this insertion in memory, was the cat through admin, or was it another tool? (could it be also useless to know? don't know)

calebgcc avatar Apr 27 '23 09:04 calebgcc

@pieroit I agree on using document metadata to pass context inside the hook. Source, is_summary, when and text are a perfect start.

Great! Let's use this issue for the addition

The only thing that could also be useful, but it's a bit difficult to do, is to give the context of who triggered this insertion in memory, was the cat through admin, or was it another tool? (could it be also useless to know? don't know)

That would be useful as well. Also we can track arbitrary information on what and why the cat is doing with a cat.working_memory object as proposed by @nicola-corbellini, I'll propose something more structured in #103 .

Thanks

pieroit avatar Apr 27 '23 13:04 pieroit

PR #186 added a hook to all documents that are inserted in vector_memory @pieroit and I discussed in the comments of that PR how to proceed with the other hooks concerning storing in memory. I think this issue is solved. What we need to do now is to decide how we want to implement hooks:

  • General There is only one hook that is called every time an object is stored in memory

  • Specific There is a hook specific for each object/situation which we store in memory (one for documents, one for chat history, etc...)

  • Redundant Both the General and Specific approaches. It could lead to confusion in users if it is not well explained in the docs, but it is the one that gives more control.

calebgcc avatar May 15 '23 14:05 calebgcc

Thanks @calebgcc

The redundant option seems the most powerful to me.

Let's also hear @nicola-corbellini @EugenioPetulla @Mte90 ... what do you think?

pieroit avatar May 15 '23 16:05 pieroit

I am for the redundant option too.

Let me know if I can be of any help.

nicola-corbellini avatar May 15 '23 18:05 nicola-corbellini

I am too for redundant but the general and specific need to be separated in 2 different filters, in this way based on the needs the dev can use what is the best one.

Mte90 avatar May 15 '23 19:05 Mte90