TiddlyWiki5
TiddlyWiki5 copied to clipboard
[BUG] Freelinks plugin: Does not recognise titles consisting of Chinese characters
The Freelinks plugin does not recognise links to tiddlers whose titles consist of Chinese characters.
The problem lies in the way that the Freelinks plugin constructs a massive JavaScript regular expression that matches the titles that are available for freelinking. For example, if there were only too tiddlers, called "Foo" and "Bar", the regular expression would look like this:
/(\bFoo\b)|(\bBar\b)/
The issue lies with the way that each matchable title is wrapped with \b
assertions, which is defined as:
Matches a word boundary. This is the position where a word character is not followed or preceded by another word-character, such as between a letter and a space. Note that a matched word boundary is not included in the match. In other words, the length of a matched word boundary is zero.
This is done so that a word fragment "Foo" in the text "Foobar" will not be linked. However, the problem is that word boundaries don't have meaning in Chinese, and so the regular expression never matches.
I think the best fix is to add a new global setting for whether word boundaries are respected when freelinking. The disadvantage of a global setting is that wikis containing mixed English and Chinese titles would find that the English titles would sometimes be wrongly matched.
A per-tiddler basis would be more flexible, but presumably would require us to be able to automatically distinguish titles that should be matched to word boundaries (eg detecting English vs Chinese), or would require users to explicitly mark each tiddler.
A per-tiddler basis would be more flexible, but presumably would require us to be able to automatically distinguish titles that should be matched to word boundaries (eg detecting English vs Chinese)
This rang the "Unicode properties \p{…}" bell : each character, when considered as a unicode character, bears several properties, including the writing system to which it belongs. See https://javascript.info/regexp-unicode#example-chinese-hieroglyphs
Thanks @xcazin Unicode property escapes are new to me (they were introduced in ES2018). I see browser support is not too bad.
We'd need a regexp that matched strings that exclusively contain scripts that have no word boundaries, presumably constructed from a mapping of script names to a flag indicating whether they use word boundaries. I'm not sure where we'd find that data.
It seems like it doesn't work well with words ending with an accented letter either, see : http://www.telumire.be/TW/bugs/freelink
I ran into a similar issue when I tried to count words in a tiddler, and found that the most reliable way was to split the text by spaces to get each individual words rather than using regex word boundary. This is the method used in blender, for example.
Maybe this could be used here too?
See also Splitting a String using Spaces with regex
A possible solution found on stackoverflow : https://regex101.com/r/ifgH4H/1/