core
core copied to clipboard
Add metadata to `intermediate_summaries` in `get_summary_text` method
I think it could be useful to be able to define the metadata of
intermediate_summaries = [
Document(page_content=summary) for summary in intermediate_summaries
]
at line 186 in cheshire_cat.py.
They could be both defined passing a dict as argument to the get_summary_function or inherited from the docs that are summarized. What do you think?
Hi 🙌 What would you save there? We insert metadata when we store documents in memory, as you could see here. Does this answer your questions? maybe I didn't get the issue right 😸
@calebgcc @nicola-corbellini if we do it right this can be really powerful. We want:
- the cat being able to insert docs in memory with metadata (done)
- doc+metadata should be passed through a hook so plugins can edit the doc and the metadata before the memory gets inserted.
- same thing happening in retrieval
something like
@hook
def before_insertion_in_vector_memory(doc, cat):
# do stuff with doc.page_content or doc.metadata
return doc
We define good general defaults in core_plugin and let user plugins do vertical stuff
Also in this case we may "pipe" hooks instead of running only the one with highest priority. Such a pipe would allow several plugins to edit metadata in an ordered, cumulative way.
@hook(priority=2, pipe=True)
What do you think? Let's discuss it!
@calebgcc yep, sorry I could have explained a little further.
I was dealing with a Document not yet in memory and I wanted to use load_qa_with_sources_chain from Langhchain.
However, summaries lose their source key .
Also when storing documents in memory, it could be useful to add custom metadata (in my case they would be Journal, URL, Authors of a paper).
E.g. this is how I tried to extend the saving in the WorkingMemory draft I'm using:
def remember(self, ccat, **kwargs):
# check **kwargs
for m, mem in enumerate(self.memories):
_ = ccat.memory.vectors.declarative.add_texts(
[mem.page_content],
[
{
"when": time.time(),
"text": mem.page_content,
**mem.metadata, # assuming 'source' is already in metadata
**kwargs
}
],
)
if you find this can be useful.
Btw, if I have correctly understood what @pieroit suggested, in my case it could become:
@hook
def before_insertion_in_vector_memory(doc, cat):
# process metadata
cat.working_memory.remember(cat, author=...,journal=...)
return doc
Some changes to remember would be needed then.
Does this makes sense?
something like
@hook def before_insertion_in_vector_memory(doc, cat): # do stuff with doc.page_content or doc.metadata return doc
I'm thinking now, are the doc and the cat enough as parameters? @pieroit The original source name, if it is or is not a summary, from where it was called, those things could be useful to the developer of plugins
Btw, if I have correctly understood what @pieroit suggested, in my case it could become:
@hook def before_insertion_in_vector_memory(doc, cat): # process metadata cat.working_memory.remember(cat, author=...,journal=...) return doc
Sorry, I just wanted to correct myself, as the above wouldn't be the correct use neither in my case, nor in general.
something like
@hook def before_insertion_in_vector_memory(doc, cat): # do stuff with doc.page_content or doc.metadata return docI'm thinking now, are the doc and the cat enough as parameters? @pieroit The original source name, if it is or is not a summary, from where it was called, those things could be useful to the developer of plugins
Yes we should find a way to pass more context, maybe directly through the doc metadata.
In case of the rabbit_hole for example, metadata should already contain:
{
"source": source,
"is_summary": True
"when": time.time(),
"text": doc.page_content
}
As the hook is called just before memory insertion. What do you think? Additional args to the hook is no problem, but I don't know how and what to pass. Maybe it's cleaner to update doc metadata as docs are created and edited.
Since source contains the name of the original file, plugins can extract other info and add author, journal, etc as @nicola-corbellini is doing - before memory insertion. There should be also other hooks to access the whole file, textual content and raw documents. Plugins can store anything they want in a local variable and orchestrate this hooks with information they got form the cat during execution.
@pieroit I agree on using document metadata to pass context inside the hook. Source, is_summary, when and text are a perfect start.
The only thing that could also be useful, but it's a bit difficult to do, is to give the context of who triggered this insertion in memory, was the cat through admin, or was it another tool? (could it be also useless to know? don't know)
@pieroit I agree on using document metadata to pass context inside the hook. Source, is_summary, when and text are a perfect start.
Great! Let's use this issue for the addition
The only thing that could also be useful, but it's a bit difficult to do, is to give the context of who triggered this insertion in memory, was the cat through admin, or was it another tool? (could it be also useless to know? don't know)
That would be useful as well. Also we can track arbitrary information on what and why the cat is doing with a cat.working_memory object as proposed by @nicola-corbellini, I'll propose something more structured in #103 .
Thanks
PR #186 added a hook to all documents that are inserted in vector_memory @pieroit and I discussed in the comments of that PR how to proceed with the other hooks concerning storing in memory. I think this issue is solved. What we need to do now is to decide how we want to implement hooks:
-
General There is only one hook that is called every time an object is stored in memory
-
Specific There is a hook specific for each object/situation which we store in memory (one for documents, one for chat history, etc...)
-
Redundant Both the General and Specific approaches. It could lead to confusion in users if it is not well explained in the docs, but it is the one that gives more control.
Thanks @calebgcc
The redundant option seems the most powerful to me.
Let's also hear @nicola-corbellini @EugenioPetulla @Mte90 ... what do you think?
I am for the redundant option too.
Let me know if I can be of any help.
I am too for redundant but the general and specific need to be separated in 2 different filters, in this way based on the needs the dev can use what is the best one.