datashare icon indicating copy to clipboard operation
datashare copied to clipboard

feat: caching PDF page indices

Open bamthomas opened this issue 7 months ago • 1 comments

Is your feature request related to a problem? Please describe.

#1648 implemented the page indices extraction and content page extraction. There are 2 main issues with it:

  1. it can be very long because it is based on tika content extraction of the original source file
  2. it can be de-synchronized from the indexed content because the tika version could have changed between content extraction and page indices extraction.

Describe the solution you'd like

For the first point we could cache the page indices in json files on filesystem like we did with artifact extraction. In the cache files it could be usefull to have the version of Tika to compare it with the indexed content metadata.

For now the content metadata is not containing the tika version so we'd need to add it.

It could be added by a script for existing indices (with extraction date -> datashare version -> extract version -> tika version).

When the version are not the same, we could add a sync feature that would re-index file content to the index. That would be eventually specified in another issue.

Describe alternatives you've considered

We also thought of extracting pages directly to avoid de-synchronization but it would need much more disk space.

The file format would be:

{
   "extractor_version": "Tika-3.0.1"
   "pages": [
      [0, 123],
      [124, 432],
      [433, 654]
   ]
}

bamthomas avatar May 28 '25 13:05 bamthomas

maybe we could use this map hardcoded to make a polyfill for getting the tika indexing version if it does not exist in document metadata by the document indexing date.

The dates/versions information comes from all pom.xml updates of tika version. Usually we are releasing extract and updating datashare right away so it should be quite accurate.

date tika_version
1970-01-01T00:00:00 1.8.0
2015-07-08T03:59:48 1.9.0
2015-08-10T23:05:19 1.10.0
2015-10-28T13:36:23 1.11.0
2016-02-27T01:46:47 1.12.0
2016-05-24T15:29:32 1.13.0
2016-11-04T18:58:15 1.14rc1
2016-11-14T12:23:43 1.14.0
2017-06-11T18:43:49 1.15.0
2017-08-17T16:09:55 1.16.0
2018-02-13T09:49:53 1.17.0
2018-06-11T13:04:21 1.18.0
2019-06-07T10:35:53 1.20.0
2019-08-12T15:16:26 1.22.0
2020-09-14T08:27:25 1.24.1
2020-09-16T16:20:26 1.22.0
2021-04-02T16:13:51 1.24.1
2021-04-02T16:52:36 1.22.0
2022-10-10T11:10:34 1.23.0
2022-10-20T11:57:11 2.4.1
2025-03-12T09:58:47 2.9.3
2025-03-12T10:52:07 3.1.0

bamthomas avatar May 28 '25 15:05 bamthomas