haystack `WhisperTranscriber` to add filename to document metadata

It would be great if we provided the option to add the filename to the metadata of the documents that the WhisperTranscribercreates. Currently there's not good way of doing this. This would really help when building RAG pipelines where you want to query videos, but you want to reference the video in the response.

Sep 04 '23 22:09 TuanaCelik

Additional learning with @anakin87 : It seems that even if we want to add the meta via an indexing pipeline, as shown below, the meta will get ignored. I think this might be because the root node (Whisper) ignores the meta.

The indexing pipeline:

whisper = WhisperTranscriber(api_key=api_key)

indexing_pipeline = Pipeline()
indexing_pipeline.add_node(component=whisper, name="Whisper", inputs=["File"])
indexing_pipeline.add_node(component=preprocessor, name="Preprocessor", inputs=["Whisper"])
indexing_pipeline.add_node(component=embedder, name="Embedder", inputs=["Preprocessor"])
indexing_pipeline.add_node(component=document_store, name="DocumentStore", inputs=["Embedder"])

videos = ["https://www.youtube.com/watch?v=h5id4erwD4s", "https://www.youtube.com/watch?v=iFUeV3aYynI"]

# for video in videos:
file_path1 = youtube2audio("https://www.youtube.com/watch?v=h5id4erwD4s")
file_path2 = youtube2audio("https://www.youtube.com/watch?v=iFUeV3aYynI")
doc1 = {'file_path': file_path1, "url": "https://www.youtube.com/watch?v=h5id4erwD4s"}
doc2 = {'file_path': file_path2, "url": "https://www.youtube.com/watch?v=iFUeV3aYynI"}

indexing_pipeline.run(file_paths=[doc1['file_path'], doc2['file_path']], meta=[{"url": doc['url'] for doc in [doc1, doc2]}])

Sep 05 '23 08:09 TuanaCelik

As Tuana said, meta is ignored.

See, for example, the run method: https://github.com/deepset-ai/haystack/blob/a5b815690ed7343882603a675c621ffc4c129c9b/haystack/nodes/audio/whisper_transcriber.py#L176-L186

Sep 05 '23 09:09 anakin87

The issue was related to 1.x, which is in maintenance mode.

In 2.x, this information is added to Document.meta if available.

Oct 28 '24 16:10 anakin87