thorium-reader icon indicating copy to clipboard operation
thorium-reader copied to clipboard

in-document publication-wide text search improvements

Open danielweck opened this issue 5 years ago • 1 comments

We already support multiline / line break matching with collapsing of whitespace / DOM space-tab identation. However, we need to implement string normalization of ligatures / unicode pairs, diacritics / accented characters, etc.

See: https://github.com/edrlab/thorium-reader/pull/1086#issue-431414659 https://github.com/edrlab/thorium-reader/pull/1049/files#diff-600356d6a987e6ef5c15d0a38e75c5888e8483d82dd7e9e5fe7c875df4813e50

danielweck avatar Oct 13 '20 01:10 danielweck

Related issue: https://github.com/edrlab/thorium-reader/issues/1213

danielweck avatar Oct 14 '20 08:10 danielweck

Good candidate for HTML search algorithm (including diacritics removal / text normalisation with support for back-referencing the original DOM offsets). The highlighting is performed with DOM mutations (insertion of mark element), but we just need the raw output data mapped to our own DOM range serialisation so we can use our own highlighter. https://github.com/GoogleChromeLabs/text-fragments-polyfill/

  • normalise search space as well as query so a simple indexOf() can be used, then de-normalise the matched range to index into the real DOM: https://github.com/GoogleChromeLabs/text-fragments-polyfill/blob/53375fea08665bac009bb0aa01a030e065c3933d/src/text-fragment-utils.js#L924-L934 ==> https://github.com/GoogleChromeLabs/text-fragments-polyfill/blob/53375fea08665bac009bb0aa01a030e065c3933d/src/text-fragment-utils.js#L718-L719 ==> https://github.com/GoogleChromeLabs/text-fragments-polyfill/blob/53375fea08665bac009bb0aa01a030e065c3933d/src/text-fragment-utils.js#L769-L830
  • use regexp to avoid normalising search space and query (the regexp defines normalisation rules but the regexp matches are expressed relative to the real DOM text): https://github.com/julkue/mark.js/blob/7f7e9820514e2268918c2259b58aec3bd5f437f6/src/lib/regexpcreator.js#L305-L319 ==> https://github.com/julkue/mark.js/blob/7f7e9820514e2268918c2259b58aec3bd5f437f6/src/lib/mark.js#L837

danielweck avatar Feb 02 '23 18:02 danielweck

Related: https://github.com/edrlab/thorium-reader/issues/1920

danielweck avatar Mar 05 '23 20:03 danielweck