haystack icon indicating copy to clipboard operation
haystack copied to clipboard

`MetadataBuilder`

Open sjrl opened this issue 2 years ago • 7 comments

See the proposal: https://github.com/deepset-ai/haystack/pull/5540 and see feature request for Haystack v1


LLMs clients output strings, but many components expect other object types, and LLMs may produce output in a parsable format that can be directly converted into objects. Output parsers transform these strings into objects of the user’s choosing.

MetadataBuilder. It takes the string replies and inserts them as metadata into the Documents that were originally passed to the LLM. I'm open to renaming this one, since the goal would be to output Documents with inserted metadata.

For example, a PromptNode could be used to summarize a longer doc and the user would like to have the result inserted as metadata for that Document. There it would allow us to easily add category tags, sentiment, summaries (...) to docs that can be utilized later at query time (e.g. to filter down the search space efficiently or utilize the metadata for online retrieval/generation steps)

sjrl avatar Sep 01 '23 10:09 sjrl

More information on the expected use cases and component I/O can be found here.

In general, it is probably best to focus on developing this component once looping and input lists are handleable by the Pipelines. (Otherwise, we would be going to build a component that is effectively unusable in the Pipelines.)

anakin87 avatar Jan 15 '24 17:01 anakin87

@sjrl We are considering this issue for our next sprint. Is there any new info that will be relevant for the implementation of this component?

julian-risch avatar Jun 28 '24 10:06 julian-risch

This is probably relevant: https://www.notion.so/deepsetai/Advanced-Use-Case-Automatic-Metadata-Enrichment-8fdfc56e82434459963beaa7a9dc5069

anakin87 avatar Jun 28 '24 11:06 anakin87

Hey @julian-risch thanks for reaching out! No new info on my end. I think the work @davidsbatista did that @anakin87 linked is exactly the type of use case we are thinking about. In general metadata enrichment of files to help with retrieval through filters, embed meta fields, etc. Also possibly for downstream applications (e.g. they want to show a summary along side a retrieved file). I'd be particularly interested in a set up that would allow me to automatically extract things like title, authors, publication date, etc. from PDF files and then save that as metadata with the file.

sjrl avatar Jun 28 '24 11:06 sjrl

see https://github.com/deepset-ai/haystack/issues/5700 - it's related/duplicated

davidsbatista avatar Jul 03 '24 10:07 davidsbatista

@davidsbatista Could you please check again if we can merge the two issues https://github.com/deepset-ai/haystack/issues/5700 and https://github.com/deepset-ai/haystack/issues/5702 or whether they should remain separate?

julian-risch avatar Sep 09 '24 06:09 julian-risch

After discussing it with Sebastian, these two issues should be merged.

davidsbatista avatar Sep 13 '24 07:09 davidsbatista