Extract structure from PDF and markdown files
Is your feature request related to a problem? Please describe. We expect structural information to improve quality of search results so that predicted answers can be mapped back to original documents (e.g. pdf pages).
Plan
How to add structural information like headlines as metadata to Documents? Problem: File Converters return a single Document object containing the whole text. This Document is only split at the PreProcessor, which doesn't access the original PDF/markdown files. One possibility: Add a metadata field to the Document containing the headlines + the spans for which they apply for the whole file. This metadata field needs to be adapted in the PreProcessor.
- [ ] #3056
- [ ] #3057 ~- [ ] #3058~
Let's remove #3058 from the scope of this epic, we'll tackle that separately.