Optional `FileConverter` parameter to add name to `meta`
Often we need to add file names to the meta field. Some nodes like the crawler does this by default for the URL names. Could it be an idea to have an optional parameter add_name_to_meta = True or False to the converters to do this automatically. Now the next best option is to add this in a pipeline.run() or convert() call and for loop for each individual file.
@ZanSara If it's already possible, we should improve documentation so that it's easier to understand.
Now the next best option is to add this in a pipeline.run() or convert() call and for loop for each individual file.
Actually this is not correct. Currently the best option is to use
pipeline.run(file_paths=[the files], meta=[{"name": file.name for name in [the files]}])
(pseudocode, but you get the idea). That's what all examples and tutorials are doing. For example, in examples/basic_qa_pipeline.py:
https://github.com/deepset-ai/haystack/blob/fd25106c883bba36a4f5276792f024d4622130b3/examples/basic_qa_pipeline.py#L25-L26
https://github.com/deepset-ai/haystack/blob/fd25106c883bba36a4f5276792f024d4622130b3/examples/basic_qa_pipeline.py#L53
If you're still convinced we need a more concise way than this to add the names to the files, we'll keep the issue open.
The idea of having metadata referencing the original files in converted/splitted Documents is very common in discussions with the community, so we can probably do a better job teaching this.
How/Where to teach this?
- Preprocessing Tutorial: seems like right place to teach this, but the problem is that we are still using the magic
convert_files_to_docsinstead of a Pipeline and this function does not allow passing custommeta. It might be worth changing the tutorial... - FAQ: there is something about this topic. Maybe it can be expanded, even if I don't know how much this page is used by people.
(FYI @dfokina @bilgeyucel )
For Haystack 1.x, there is a workaround: pipeline.run(file_paths=[the files], meta=[{"name": file.name for name in [the files]}]) to assign the filename as metadata to documents at indexing time. That should be enough.
In Haystack 2.0 this is not an issue. Indexed documents have the filename in their metadata.