`WhisperTranscriber` to add filename to document metadata
It would be great if we provided the option to add the filename to the metadata of the documents that the WhisperTranscribercreates. Currently there's not good way of doing this. This would really help when building RAG pipelines where you want to query videos, but you want to reference the video in the response.
Additional learning with @anakin87 :
It seems that even if we want to add the meta via an indexing pipeline, as shown below, the meta will get ignored. I think this might be because the root node (Whisper) ignores the meta.
The indexing pipeline:
whisper = WhisperTranscriber(api_key=api_key)
indexing_pipeline = Pipeline()
indexing_pipeline.add_node(component=whisper, name="Whisper", inputs=["File"])
indexing_pipeline.add_node(component=preprocessor, name="Preprocessor", inputs=["Whisper"])
indexing_pipeline.add_node(component=embedder, name="Embedder", inputs=["Preprocessor"])
indexing_pipeline.add_node(component=document_store, name="DocumentStore", inputs=["Embedder"])
videos = ["https://www.youtube.com/watch?v=h5id4erwD4s", "https://www.youtube.com/watch?v=iFUeV3aYynI"]
# for video in videos:
file_path1 = youtube2audio("https://www.youtube.com/watch?v=h5id4erwD4s")
file_path2 = youtube2audio("https://www.youtube.com/watch?v=iFUeV3aYynI")
doc1 = {'file_path': file_path1, "url": "https://www.youtube.com/watch?v=h5id4erwD4s"}
doc2 = {'file_path': file_path2, "url": "https://www.youtube.com/watch?v=iFUeV3aYynI"}
indexing_pipeline.run(file_paths=[doc1['file_path'], doc2['file_path']], meta=[{"url": doc['url'] for doc in [doc1, doc2]}])
As Tuana said, meta is ignored.
See, for example, the run method:
https://github.com/deepset-ai/haystack/blob/a5b815690ed7343882603a675c621ffc4c129c9b/haystack/nodes/audio/whisper_transcriber.py#L176-L186
The issue was related to 1.x, which is in maintenance mode.
In 2.x, this information is added to Document.meta if available.