ARK-Navigator icon indicating copy to clipboard operation
ARK-Navigator copied to clipboard

Text search

Open kirillt opened this issue 3 years ago • 2 comments

It should be possible to filter documents by presence of query word in their text content. For this, it's necessary to implement:

  1. Text-layer extraction from PDF and other text-based documents. This must happen during folder indexing.
  2. New button or menu option in Resources Grid screen, which displays text input box. The string from the input should be searched in text layers of all resources. Matched resources must be displayed.

kirillt avatar Jan 04 '22 10:01 kirillt

I would suggest we would go further on this and we could use Tesseract for OCR text recognition of images and PDF English (and possible other languages) documents. In this way, we could have text metadata attached to each PDF and image files and not only plain text files.

The next observations must be taken into account:

  1. It should be studied what could be done for Microsoft Office, LibreOffice (rich-formatted).
  2. It should be studied what could be done with binary files.
  3. If we are required, we could use Tesseract TryGetBoundingBox function for highlighting results in PDF and image files at a detailed search results view.
  4. For rich-formatted documents we should use other solution as the one explained in the point above.

sisco0 avatar Jan 10 '22 15:01 sisco0

Good thoughts, I've just created separate issue for text layer, since it can also be used for tags suggestions: #183

kirillt avatar Jan 11 '22 16:01 kirillt