URL detection result of find method changed in v3 ?
When using find method to detect URL, I found that the detection results were different in v3 when there were no spaces before and after the URL.
v2
// URL
foo http://example.com bar
foo http://example.combar
foohttp://example.com bar
foohttp://example.combar
テストhttp://example.comテスト
v3
// URL
foo http://example.com bar
foo http://example.combar
foohttp://example.com bar
// Not URL
foohttp://example.combar
テストhttp://example.comテスト
Is this expected behavior? If so, I would like to see the following fix, even if it’s only for multi-byte characters, because we often write like this in Japanese.
// URL
foo http://example.com bar
foo http://example.combar
foohttp://example.com bar
テストhttp://example.comテスト
// Not URL
foohttp://example.combar
ref: #315
Hi @sunadoi, thanks for reporting.
The reasons for this regression in v3 are a bit complex related to the extended parsing I added to support Internationalized Domain Names (IDN). The parser now recognizes テスト as words, where in v2 they were treated as unknown symbols. The parser is greedy (tries to identify the longest possible tokens without backtracking) and since there is no delimiting whitespace it treats テストhttp as a word and the rest as an invalid URL.
I believe I can fix this by making a distinction in the parser between ASCII words and non-ASCII words. Unfortunately, because of ambiguity in these types of examples, the best I can get with this plugin will be the following (I used {{}} to mark which portions of text will be identified as links):
foo {{http://example.com}} bar
foo {{http://example.combar}}
foohttp://{{example.com}} bar
テスト{{http://example.comテスト}}
I hope that works for you because I unfortunately I cannot think of a good strategy to cover all edge cases like this.
@nfrasser
Thank you for your kind explanation. The fix you suggested works for me. I’ll be happy to see it😄
Fixed in the latest v4 release.