SMF icon indicating copy to clipboard operation
SMF copied to clipboard

Search issue - multi-byte words truncated

Open sbulen opened this issue 4 years ago • 3 comments

Description

If there are multi-byte terms in a post, they are translated to html entities, which can take up to 8 or 9 bytes per character. The problem is that search words are truncated at 20 characters. Thus, any string of multibyte characters can result in truncation - mid html-entity, e.g.: image

Note the truncated html entities, e.g., &# and &#66. This is in the log_search_subjects table.

This causes further issues down the road, e.g., executing an html entity to utf8 conversion, you can get: image ...as that word really isn't unique once it has been truncated.

This issue exists in 2.0 as well.

In 2.1, this issue is restricted to 4-byte character usage, as anything <4-bytes is no longer converted to html entities - though they may be brought forward during an upgrade.

Steps to reproduce

  1. Post this: 𐌼𐌰𐌲 𐌲𐌻𐌴𐍃 𐌹̈𐍄𐌰𐌽, 𐌽𐌹 𐌼𐌹𐍃 𐍅𐌿 𐌽𐌳𐌰𐌽 𐌱𐍂𐌹𐌲𐌲𐌹𐌸.
  2. Look at log_search_subjects for that post

Environment (complete as necessary)

  • Version/Git revision: current
  • Database Type: mysql
  • Database Version: 5.7
  • PHP Version: 7.4

Additional information/references

4-byte characters are not common outside the use of emojis, certain symbols, and ancient texts....
But the SMF crowd is exactly the kinda crowd to use emojis, certain symbols, and ancient texts...

sbulen avatar Dec 13 '20 19:12 sbulen

Based on the description, The issue exists only in MySQL.

In MySQL could be a workaround convert the database to mb4 and change the parameter in sub mysql file.

albertlast avatar Dec 14 '20 05:12 albertlast

Correct, mysql only. MB4 will require some db changes & increasing minimum version of mysql.

But mb4 is definitely the right direction.

sbulen avatar Dec 14 '20 06:12 sbulen

What i mean, smf is already prepared for this, since i added some time ago this parameter, which can be changed by hand: https://github.com/SimpleMachines/SMF2.1/blob/bb7f39482ba05b0d7d209f302e682d01e6a9fbd9/Sources/Subs-Db-mysql.php#L60

albertlast avatar Dec 14 '20 06:12 albertlast