haystack icon indicating copy to clipboard operation
haystack copied to clipboard

Optional `FileConverter` parameter to add name to `meta`

Open TuanaCelik opened this issue 2 years ago • 2 comments

Often we need to add file names to the meta field. Some nodes like the crawler does this by default for the URL names. Could it be an idea to have an optional parameter add_name_to_meta = True or False to the converters to do this automatically. Now the next best option is to add this in a pipeline.run() or convert() call and for loop for each individual file.

TuanaCelik avatar Jun 28 '23 10:06 TuanaCelik

@ZanSara If it's already possible, we should improve documentation so that it's easier to understand.

julian-risch avatar Jul 05 '23 13:07 julian-risch

Now the next best option is to add this in a pipeline.run() or convert() call and for loop for each individual file.

Actually this is not correct. Currently the best option is to use

pipeline.run(file_paths=[the files], meta=[{"name": file.name for name in [the files]}])

(pseudocode, but you get the idea). That's what all examples and tutorials are doing. For example, in examples/basic_qa_pipeline.py:

https://github.com/deepset-ai/haystack/blob/fd25106c883bba36a4f5276792f024d4622130b3/examples/basic_qa_pipeline.py#L25-L26

https://github.com/deepset-ai/haystack/blob/fd25106c883bba36a4f5276792f024d4622130b3/examples/basic_qa_pipeline.py#L53

If you're still convinced we need a more concise way than this to add the names to the files, we'll keep the issue open.

ZanSara avatar Jul 06 '23 09:07 ZanSara

The idea of having metadata referencing the original files in converted/splitted Documents is very common in discussions with the community, so we can probably do a better job teaching this.

How/Where to teach this?

  • Preprocessing Tutorial: seems like right place to teach this, but the problem is that we are still using the magic convert_files_to_docs instead of a Pipeline and this function does not allow passing custom meta. It might be worth changing the tutorial...
  • FAQ: there is something about this topic. Maybe it can be expanded, even if I don't know how much this page is used by people.

(FYI @dfokina @bilgeyucel )

anakin87 avatar Jul 18 '23 08:07 anakin87

For Haystack 1.x, there is a workaround: pipeline.run(file_paths=[the files], meta=[{"name": file.name for name in [the files]}]) to assign the filename as metadata to documents at indexing time. That should be enough. In Haystack 2.0 this is not an issue. Indexed documents have the filename in their metadata.

julian-risch avatar Apr 08 '24 13:04 julian-risch