joplin icon indicating copy to clipboard operation
joplin copied to clipboard

search does not return matches in notes imported to html if source contains html named characters

Open stephan-dev opened this issue 1 year ago • 2 comments

Probably not a problem in English, but for example in French there are too many non ASCII characters that, in html, become named characters like é => é, ç => ç, etc. Most often, this breaks joplin search.

Screenshot : Instead of showing an empty search results page (the problem), I'm showing a theoretical (successful) search, where user would include the html named characters in their query : l'URSS, c'est le Stalinisme (top left corner)

joplin search html bug 2

This shows that Joplin is searching the html source code, instead of searching the parsed text (like a Ctrl+F would do in a browser : "find inside this page").

STR :

  • import ENEX notes to html
  • choose a target search string,
    • that contains characters like é, ç, à, c'est...
    • and that exists inside a note imported to html
  • search for it in joplin search (from the box or Ctrl+P) with or without the accents

Expected

  • string is found in all the notes where it exists, including notes imported to .html

Bug :

  • the string is not found in notes imported to .html, if the html source code contains html named characters for this query.

I call this a bug (if confirmed) because I've done this import 1 month ago, and it's the first time I realize that for a month I've been working with partial search results. There's no warning about this. I tried to read the code, I have no idea how to fix.

Joplin 2.8.8 on Linux.

stephan-dev avatar Aug 14 '22 05:08 stephan-dev

So I think this is what breaks search of notes imported to html https://github.com/laurent22/joplin/blob/641b0fa9a2/packages/lib/import-enex-html-gen.js#L77

	saxStream.on('text', function(text) {
		section.lines.push(htmlentities(text));
	});

The commit is named "fix various bugs related to the import of ENEX files as HTML" https://github.com/laurent22/joplin/commit/fcd00b32125744636fb1c5b9c4a3b71cf2520edc What bug was this fixing in particular ? Can we estimate the cost / benefit of it vs breaking sqlite FTS search in non-english languages ?

I haven't tried to build joplin with that line edited or tried to import Enex => Html without it, so I'm not sure I'm on the right track.

for reference,

  • HTML entities https://wikiless.org/wiki/List_of_XML_and_HTML_character_entity_references (which I called "html named characters" above)
  • SAX parser documentation https://docs.oracle.com/javase/tutorial/jaxp/sax/parsing.html (subtitles "character events" and "handling special characters") (doc is for Java, but not as clear for Js)

stephan-dev avatar Aug 19 '22 06:08 stephan-dev

Hey there, it looks like there has been no activity on this issue recently. Has the issue been fixed, or does it still require the community's attention? If you require support or are requesting an enhancement or feature then please create a topic on the Joplin forum. This issue may be closed if no further activity occurs. You may comment on the issue and I will leave it open. Thank you for your contributions.

github-actions[bot] avatar Sep 18 '22 16:09 github-actions[bot]

Closing this issue after a prolonged period of inactivity. If this issue is still present in the latest release, feel free to create a new issue with up-to-date information.

github-actions[bot] avatar Sep 25 '22 16:09 github-actions[bot]