haystack icon indicating copy to clipboard operation
haystack copied to clipboard

`WhisperTranscriber` to add filename to document metadata

Open TuanaCelik opened this issue 2 years ago • 2 comments

It would be great if we provided the option to add the filename to the metadata of the documents that the WhisperTranscribercreates. Currently there's not good way of doing this. This would really help when building RAG pipelines where you want to query videos, but you want to reference the video in the response.

TuanaCelik avatar Sep 04 '23 22:09 TuanaCelik

Additional learning with @anakin87 : It seems that even if we want to add the meta via an indexing pipeline, as shown below, the meta will get ignored. I think this might be because the root node (Whisper) ignores the meta.

The indexing pipeline:

whisper = WhisperTranscriber(api_key=api_key)

indexing_pipeline = Pipeline()
indexing_pipeline.add_node(component=whisper, name="Whisper", inputs=["File"])
indexing_pipeline.add_node(component=preprocessor, name="Preprocessor", inputs=["Whisper"])
indexing_pipeline.add_node(component=embedder, name="Embedder", inputs=["Preprocessor"])
indexing_pipeline.add_node(component=document_store, name="DocumentStore", inputs=["Embedder"])
videos = ["https://www.youtube.com/watch?v=h5id4erwD4s", "https://www.youtube.com/watch?v=iFUeV3aYynI"]

# for video in videos:
file_path1 = youtube2audio("https://www.youtube.com/watch?v=h5id4erwD4s")
file_path2 = youtube2audio("https://www.youtube.com/watch?v=iFUeV3aYynI")
doc1 = {'file_path': file_path1, "url": "https://www.youtube.com/watch?v=h5id4erwD4s"}
doc2 = {'file_path': file_path2, "url": "https://www.youtube.com/watch?v=iFUeV3aYynI"}

indexing_pipeline.run(file_paths=[doc1['file_path'], doc2['file_path']], meta=[{"url": doc['url'] for doc in [doc1, doc2]}])

TuanaCelik avatar Sep 05 '23 08:09 TuanaCelik

As Tuana said, meta is ignored.

See, for example, the run method: https://github.com/deepset-ai/haystack/blob/a5b815690ed7343882603a675c621ffc4c129c9b/haystack/nodes/audio/whisper_transcriber.py#L176-L186

anakin87 avatar Sep 05 '23 09:09 anakin87

The issue was related to 1.x, which is in maintenance mode.

In 2.x, this information is added to Document.meta if available.

anakin87 avatar Oct 28 '24 16:10 anakin87