Support multimodal use cases
Haystack 2.0 documents support more than just text data but multimodal uses cases require tailored components. This epic issue is about enabling one of those use cases and getting an understanding of what blocks other use cases. Focus should be on preprocessors, retrievers and readers/generators for Table QA.
Table QA Includes better preprocessing for example for tables split across multiple PDF pages, retrievers, and readers/generators
Visual QA Answering questions about the content of images in general using both visual and textual information
Image Captioning Indexing images and creating captions for better retrieval
Speech-to-Text and Text-to-Speech (in addition to our WhisperTranscriber components)
### Tasks
- [ ] https://github.com/deepset-ai/haystack-core-integrations/issues/485
- [ ] https://github.com/deepset-ai/haystack/issues/5943
Related issue about multimodal embedding https://github.com/deepset-ai/haystack/issues/5943