Support multimodal use cases

Open julian-risch opened this issue 1 year ago • 1 comments

Haystack 2.0 documents support more than just text data but multimodal uses cases require tailored components. This epic issue is about enabling one of those use cases and getting an understanding of what blocks other use cases. Focus should be on preprocessors, retrievers and readers/generators for Table QA.

Table QA Includes better preprocessing for example for tables split across multiple PDF pages, retrievers, and readers/generators

Visual QA Answering questions about the content of images in general using both visual and textual information

Image Captioning Indexing images and creating captions for better retrieval

Speech-to-Text and Text-to-Speech (in addition to our WhisperTranscriber components)

### Tasks
- [ ] https://github.com/deepset-ai/haystack-core-integrations/issues/485
- [ ] https://github.com/deepset-ai/haystack/issues/5943

Jan 02 '24 11:01 julian-risch

Related issue about multimodal embedding https://github.com/deepset-ai/haystack/issues/5943

Jan 04 '24 08:01 julian-risch