datashare
datashare copied to clipboard
Index or re-index a single file content
Is your feature request related to a problem? Please describe.
#1648 implemented the page indices extraction and content page extraction. There are 2 main issues with it:
- it can be very long because it is based on tika content extraction of the original source file
- it can be de-synchronized from the indexed content because the tika version could have changed between content extraction and page indices extraction.
Describe the solution you'd like
For the first point see #1814 For the second point, we could add a backend endpoint to reindex a file from its id:
PUT /api/task/reindex/<doc_id>
It could be idempotent, doing nothing if the tika version in metadata is already the latest one. That would launch a background task that is idexing the file. Or could be sync.