social-app icon indicating copy to clipboard operation
social-app copied to clipboard

The title and description strings of link cards to be added to a post are cut off in the middle of utf-8 multibyte characters, making them unreadable.

Open henoya opened this issue 1 year ago • 4 comments

Describe the bug When the URL of the linked page is entered in the post and the "Add linkcard" button is pressed, the title and description strings of the linked page may also appear with unreadable characters at the end.

To Reproduce

Examples of links that do not display well:

(1) https://www3.nhk.or.jp/kansai-news/20231115/2000079610.html 正規サイトから情報不正入手「ウェブスキミン

Title and description of page (1) title: 正規サイトから情報不正入手「ウェブスキミング」初検挙 京都|NHK 関西のニュース description: 【NHK】音楽グループの公式サイトに不正なプログラムを仕掛け、このサイトで買い物した人のクレジットカードの情報を不正に入手したとして京都府警は、26…

Result of retrieving page (1) using "cardyb" api

https://cardyb.bsky.app/v1/extract?url=https://www3.nhk.or.jp/kansai-news/20231115/2000079610.html

{
  "error":"",
  "likely_type":"html",
  "url":"https://www3.nhk.or.jp/kansai-news/20231115/2000079610.html",
  "title":"正規サイトから情報不正入手「ウェブスキミング」初検挙 京都|NHK 関西\ufffd\ufffd...",
  "description":"【NHK】音楽グループの公式サイトに不正なプログラムを仕掛け、このサイトで買い物した人のクレジットカードの情報を不正に入手したとして京都\ufffd\ufffd...",
  "image":"https://cardyb.bsky.app/v1/image?url=https%3A%2F%2Fwww3.nhk.or.jp%2Fnews%2Fimg%2Ffb_futa_600px.png"
}

The result was obtained.

As for the title, UTF-8 byte sequence of the original string:

> xd test_title.txt
0000 e6 ad a3 e8 a6 8f e3 82  b5 e3 82 a4 e3 83 88 e3   正 規 サ イ ト か 
0010 81 8b e3 82 89 e6 83 85  e5 a0 b1 e4 b8 8d e6 ad     ら 情 報 不 正 
0020 a3 e5 85 a5 e6 89 8b e3  80 8c e3 82 a6 e3 82 a7    入 手 「 ウ ェ 
0030 e3 83 96 e3 82 b9 e3 82  ad e3 83 9f e3 83 b3 e3   ブ ス キ ミ ン グ 
0040 82 b0 e3 80 8d e5 88 9d  e6 a4 9c e6 8c 99 20 e4     」 初 検 挙  京 
0050 ba ac e9 83 bd ef bd 9c  4e 48 4b 20 e9 96 a2 e8     都 | NHK 関 西 
0060 a5 bf e3 81 ae e3 83 8b  e3 83 a5 e3 83 bc e3 82     の ニ ュ ー ス 
0070 b9                                                                 

while the title byte string retrieved from the "cardyb" api is

❯ xd test_bad.txt
0000 7b 22 65 72 72 6f 72 22  3a 22 22 2c 22 6c 69 6b   {"error":"","lik
0010 65 6c 79 5f 74 79 70 65  22 3a 22 68 74 6d 6c 22   ely_type":"html"
0020 2c 22 75 72 6c 22 3a 22  68 74 74 70 73 3a 2f 2f   ,"url":"https://
0030 77 77 77 33 2e 6e 68 6b  2e 6f 72 2e 6a 70 2f 6b   www3.nhk.or.jp/k
0040 61 6e 73 61 69 2d 6e 65  77 73 2f 32 30 32 33 31   ansai-news/20231
0050 31 31 35 2f 32 30 30 30  30 37 39 36 31 30 2e 68   115/2000079610.h
0060 74 6d 6c 22 2c 22 74 69  74 6c 65 22 3a 22 e6 ad   tml","title":"正 
0070 a3 e8 a6 8f e3 82 b5 e3  82 a4 e3 83 88 e3 81 8b    規 サ イ ト か 
0080 e3 82 89 e6 83 85 e5 a0  b1 e4 b8 8d e6 ad a3 e5   ら 情 報 不 正 入 
0090 85 a5 e6 89 8b e3 80 8c  e3 82 a6 e3 82 a7 e3 83     手 「 ウ ェ ブ 
00a0 96 e3 82 b9 e3 82 ad e3  83 9f e3 83 b3 e3 82 b0    ス キ ミ ン グ 
00b0 e3 80 8d e5 88 9d e6 a4  9c e6 8c 99 20 e4 ba ac   」 初 検 挙  京 
00c0 e9 83 bd ef bd 9c 4e 48  4b 20 e9 96 a2 e8 a5 bf   都 | NHK 関 西 
00d0 5c 75 66 66 66 64 5c 75  66 66 66 64 2e 2e 2e 22   \ufffd\ufffd..."
00e0 2c 22 64 65 73 63 72 69  70 74 69 6f 6e 22 3a 22   ,"description":"
00f0 e3 80 90 4e 48 4b e3 80  91 e9 9f b3 e6 a5 bd e3   【 NHK】 音 楽 グ 
0100 82 b0 e3 83 ab e3 83 bc  e3 83 97 e3 81 ae e5 85     ル ー プ の 公 
0110 ac e5 bc 8f e3 82 b5 e3  82 a4 e3 83 88 e3 81 ab    式 サ イ ト に 
0120 e4 b8 8d e6 ad a3 e3 81  aa e3 83 97 e3 83 ad e3   不 正 な プ ロ グ 
0130 82 b0 e3 83 a9 e3 83 a0  e3 82 92 e4 bb 95 e6 8e     ラ ム を 仕 掛 
0140 9b e3 81 91 e3 80 81 e3  81 93 e3 81 ae e3 82 b5    け 、 こ の サ 
0150 e3 82 a4 e3 83 88 e3 81  a7 e8 b2 b7 e3 81 84 e7   イ ト で 買 い 物 
0160 89 a9 e3 81 97 e3 81 9f  e4 ba ba e3 81 ae e3 82     し た 人 の ク 
0170 af e3 83 ac e3 82 b8 e3  83 83 e3 83 88 e3 82 ab    レ ジ ッ ト カ 
0180 e3 83 bc e3 83 89 e3 81  ae e6 83 85 e5 a0 b1 e3   ー ド の 情 報 を 
0190 82 92 e4 b8 8d e6 ad a3  e3 81 ab e5 85 a5 e6 89     不 正 に 入 手 
01a0 8b e3 81 97 e3 81 9f e3  81 a8 e3 81 97 e3 81 a6    し た と し て 
01b0 e4 ba ac e9 83 bd 5c 75  66 66 66 64 5c 75 66 66   京 都 \ufffd\uff
01c0 66 64 2e 2e 2e 22 2c 22  69 6d 61 67 65 22 3a 22   fd...","image":"
01d0 68 74 74 70 73 3a 2f 2f  63 61 72 64 79 62 2e 62   https://cardyb.b
01e0 73 6b 79 2e 61 70 70 2f  76 31 2f 69 6d 61 67 65   sky.app/v1/image
01f0 3f 75 72 6c 3d 68 74 74  70 73 25 33 41 25 32 46   ?url=https%3A%2F
0200 25 32 46 77 77 77 33 2e  6e 68 6b 2e 6f 72 2e 6a   %2Fwww3.nhk.or.j
0210 70 25 32 46 6b 61 6e 73  61 69 2d 6e 65 77 73 25   p%2Fkansai-news%
0220 32 46 32 30 32 33 31 31  31 35 25 32 46 32 30 30   2F20231115%2F200
0230 30 30 37 39 36 31 30 5f  32 30 32 33 31 31 31 35   0079610_20231115
0240 31 38 32 33 35 37 5f 6d  2e 6a 70 67 22 7d 0a      182357_m.jpg"}. 

This is the result. I'll just take out the part of the problem,

Orginal:

0050 ba ac e9 83 bd ef bd 9c  4e 48 4b 20 e9 96 a2 e8     都 | NHK 関 西 
0060 a5 bf e3 81 ae e3 83 8b  e3 83 a5 e3 83 bc e3 82     の ニ ュ ー ス 

After trimming:

00c0 e9 83 bd ef bd 9c 4e 48  4b 20 e9 96 a2 e8 a5 bf   都 | NHK 関 西 
00d0 5c 75 66 66 66 64 5c 75  66 66 66 64 2e 2e 2e 22   \ufffd\ufffd..."
Original Bytes UTF-8 Char. Trimed Bites UTF-8 Char.
4e N 4e N
48 H 48 H
4b K 4b K
20 space 20 space
e9 96 a2 e9 96 a2
e8 a5 bf 西 e8 a5 bf 西
e3 81 ae 5c 75 66 66 66 64
e3 83 8b 5c 75 66 66 66 64
e3 83 a5 2e .

The byte sequence of a UFT-8 character is omitted in the middle of a 3-byte byte sequence, which is originally the middle of a single character, resulting in a byte sequence that cannot display the character properly.

\ufffd is an Invalid Character that is displayed when data appears in UTF-8 character notation that is not a normal UTF-8 byte sequence.

In this example, the e3 81 ae -> byte sequence was probably truncated by two bytes, resulting in e3 81, which was converted to an Invalid Character and printed.

Expected behavior

Although social-app itself is currently working on multilingual support, this problem is seen as a lack of consideration for users of languages that use UTF-8 and express characters in multibyte languages.

This garbled display of link cards existed in 2023-05 when I joined Bluesky, and has been neglected for a long time. I am a Japanese speaker, but I think this is a common problem for users of languages such as Chinese, Korean, Arabic, etc., where a single character is represented by multiple bytes.

We hope you will consider this.

henoya avatar Nov 15 '23 10:11 henoya

@Jacob2161 this is likely a cardyb issue

pfrazee avatar Dec 19 '23 21:12 pfrazee

Thanks for reporting this. Yes, truncation is probably not being handled correctly and this will be fixed.

Jacob2161 avatar Dec 20 '23 03:12 Jacob2161

@Jacob2161 Hello! What is the status on resolving this issue? Is there anything I can do to help?

Users of multi-byte character-dominated languages like CJK (Chinese/Japanese/Korean) see this problem many times every day.

noritada avatar Feb 07 '24 03:02 noritada

@noritada I hope I fixed this today. Can you give it a try and let me know if you still see issues?

Jacob2161 avatar Feb 09 '24 20:02 Jacob2161

@Jacob2161 Thank you very much!! I have checked with several web pages and confirmed that the issue has been resolved.

noritada avatar Feb 13 '24 08:02 noritada