Extract structure from PDF and markdown files

Open masci opened this issue 3 years ago • 1 comments

Is your feature request related to a problem? Please describe. We expect structural information to improve quality of search results so that predicted answers can be mapped back to original documents (e.g. pdf pages).

Plan

How to add structural information like headlines as metadata to Documents? Problem: File Converters return a single Document object containing the whole text. This Document is only split at the PreProcessor, which doesn't access the original PDF/markdown files. One possibility: Add a metadata field to the Document containing the headlines + the spans for which they apply for the whole file. This metadata field needs to be adapted in the PreProcessor.

[ ] #3056
[ ] #3057 ~- [ ] #3058~

Jul 14 '22 10:07 masci

Let's remove #3058 from the scope of this epic, we'll tackle that separately.

Sep 05 '22 11:09 masci