social-app
social-app copied to clipboard
The title and description strings of link cards to be added to a post are cut off in the middle of utf-8 multibyte characters, making them unreadable.
Describe the bug When the URL of the linked page is entered in the post and the "Add linkcard" button is pressed, the title and description strings of the linked page may also appear with unreadable characters at the end.
To Reproduce
Examples of links that do not display well:
(1) https://www3.nhk.or.jp/kansai-news/20231115/2000079610.html
Title and description of page (1) title: 正規サイトから情報不正入手「ウェブスキミング」初検挙 京都|NHK 関西のニュース description: 【NHK】音楽グループの公式サイトに不正なプログラムを仕掛け、このサイトで買い物した人のクレジットカードの情報を不正に入手したとして京都府警は、26…
Result of retrieving page (1) using "cardyb" api
https://cardyb.bsky.app/v1/extract?url=https://www3.nhk.or.jp/kansai-news/20231115/2000079610.html
{
"error":"",
"likely_type":"html",
"url":"https://www3.nhk.or.jp/kansai-news/20231115/2000079610.html",
"title":"正規サイトから情報不正入手「ウェブスキミング」初検挙 京都|NHK 関西\ufffd\ufffd...",
"description":"【NHK】音楽グループの公式サイトに不正なプログラムを仕掛け、このサイトで買い物した人のクレジットカードの情報を不正に入手したとして京都\ufffd\ufffd...",
"image":"https://cardyb.bsky.app/v1/image?url=https%3A%2F%2Fwww3.nhk.or.jp%2Fnews%2Fimg%2Ffb_futa_600px.png"
}
The result was obtained.
As for the title, UTF-8 byte sequence of the original string:
> xd test_title.txt
0000 e6 ad a3 e8 a6 8f e3 82 b5 e3 82 a4 e3 83 88 e3 正 規 サ イ ト か
0010 81 8b e3 82 89 e6 83 85 e5 a0 b1 e4 b8 8d e6 ad ら 情 報 不 正
0020 a3 e5 85 a5 e6 89 8b e3 80 8c e3 82 a6 e3 82 a7 入 手 「 ウ ェ
0030 e3 83 96 e3 82 b9 e3 82 ad e3 83 9f e3 83 b3 e3 ブ ス キ ミ ン グ
0040 82 b0 e3 80 8d e5 88 9d e6 a4 9c e6 8c 99 20 e4 」 初 検 挙 京
0050 ba ac e9 83 bd ef bd 9c 4e 48 4b 20 e9 96 a2 e8 都 | NHK 関 西
0060 a5 bf e3 81 ae e3 83 8b e3 83 a5 e3 83 bc e3 82 の ニ ュ ー ス
0070 b9
while the title byte string retrieved from the "cardyb" api is
❯ xd test_bad.txt
0000 7b 22 65 72 72 6f 72 22 3a 22 22 2c 22 6c 69 6b {"error":"","lik
0010 65 6c 79 5f 74 79 70 65 22 3a 22 68 74 6d 6c 22 ely_type":"html"
0020 2c 22 75 72 6c 22 3a 22 68 74 74 70 73 3a 2f 2f ,"url":"https://
0030 77 77 77 33 2e 6e 68 6b 2e 6f 72 2e 6a 70 2f 6b www3.nhk.or.jp/k
0040 61 6e 73 61 69 2d 6e 65 77 73 2f 32 30 32 33 31 ansai-news/20231
0050 31 31 35 2f 32 30 30 30 30 37 39 36 31 30 2e 68 115/2000079610.h
0060 74 6d 6c 22 2c 22 74 69 74 6c 65 22 3a 22 e6 ad tml","title":"正
0070 a3 e8 a6 8f e3 82 b5 e3 82 a4 e3 83 88 e3 81 8b 規 サ イ ト か
0080 e3 82 89 e6 83 85 e5 a0 b1 e4 b8 8d e6 ad a3 e5 ら 情 報 不 正 入
0090 85 a5 e6 89 8b e3 80 8c e3 82 a6 e3 82 a7 e3 83 手 「 ウ ェ ブ
00a0 96 e3 82 b9 e3 82 ad e3 83 9f e3 83 b3 e3 82 b0 ス キ ミ ン グ
00b0 e3 80 8d e5 88 9d e6 a4 9c e6 8c 99 20 e4 ba ac 」 初 検 挙 京
00c0 e9 83 bd ef bd 9c 4e 48 4b 20 e9 96 a2 e8 a5 bf 都 | NHK 関 西
00d0 5c 75 66 66 66 64 5c 75 66 66 66 64 2e 2e 2e 22 \ufffd\ufffd..."
00e0 2c 22 64 65 73 63 72 69 70 74 69 6f 6e 22 3a 22 ,"description":"
00f0 e3 80 90 4e 48 4b e3 80 91 e9 9f b3 e6 a5 bd e3 【 NHK】 音 楽 グ
0100 82 b0 e3 83 ab e3 83 bc e3 83 97 e3 81 ae e5 85 ル ー プ の 公
0110 ac e5 bc 8f e3 82 b5 e3 82 a4 e3 83 88 e3 81 ab 式 サ イ ト に
0120 e4 b8 8d e6 ad a3 e3 81 aa e3 83 97 e3 83 ad e3 不 正 な プ ロ グ
0130 82 b0 e3 83 a9 e3 83 a0 e3 82 92 e4 bb 95 e6 8e ラ ム を 仕 掛
0140 9b e3 81 91 e3 80 81 e3 81 93 e3 81 ae e3 82 b5 け 、 こ の サ
0150 e3 82 a4 e3 83 88 e3 81 a7 e8 b2 b7 e3 81 84 e7 イ ト で 買 い 物
0160 89 a9 e3 81 97 e3 81 9f e4 ba ba e3 81 ae e3 82 し た 人 の ク
0170 af e3 83 ac e3 82 b8 e3 83 83 e3 83 88 e3 82 ab レ ジ ッ ト カ
0180 e3 83 bc e3 83 89 e3 81 ae e6 83 85 e5 a0 b1 e3 ー ド の 情 報 を
0190 82 92 e4 b8 8d e6 ad a3 e3 81 ab e5 85 a5 e6 89 不 正 に 入 手
01a0 8b e3 81 97 e3 81 9f e3 81 a8 e3 81 97 e3 81 a6 し た と し て
01b0 e4 ba ac e9 83 bd 5c 75 66 66 66 64 5c 75 66 66 京 都 \ufffd\uff
01c0 66 64 2e 2e 2e 22 2c 22 69 6d 61 67 65 22 3a 22 fd...","image":"
01d0 68 74 74 70 73 3a 2f 2f 63 61 72 64 79 62 2e 62 https://cardyb.b
01e0 73 6b 79 2e 61 70 70 2f 76 31 2f 69 6d 61 67 65 sky.app/v1/image
01f0 3f 75 72 6c 3d 68 74 74 70 73 25 33 41 25 32 46 ?url=https%3A%2F
0200 25 32 46 77 77 77 33 2e 6e 68 6b 2e 6f 72 2e 6a %2Fwww3.nhk.or.j
0210 70 25 32 46 6b 61 6e 73 61 69 2d 6e 65 77 73 25 p%2Fkansai-news%
0220 32 46 32 30 32 33 31 31 31 35 25 32 46 32 30 30 2F20231115%2F200
0230 30 30 37 39 36 31 30 5f 32 30 32 33 31 31 31 35 0079610_20231115
0240 31 38 32 33 35 37 5f 6d 2e 6a 70 67 22 7d 0a 182357_m.jpg"}.
This is the result. I'll just take out the part of the problem,
Orginal:
0050 ba ac e9 83 bd ef bd 9c 4e 48 4b 20 e9 96 a2 e8 都 | NHK 関 西
0060 a5 bf e3 81 ae e3 83 8b e3 83 a5 e3 83 bc e3 82 の ニ ュ ー ス
After trimming:
00c0 e9 83 bd ef bd 9c 4e 48 4b 20 e9 96 a2 e8 a5 bf 都 | NHK 関 西
00d0 5c 75 66 66 66 64 5c 75 66 66 66 64 2e 2e 2e 22 \ufffd\ufffd..."
Original Bytes | UTF-8 Char. | Trimed Bites | UTF-8 Char. |
---|---|---|---|
4e | N | 4e | N |
48 | H | 48 | H |
4b | K | 4b | K |
20 | space | 20 | space |
e9 96 a2 | 関 | e9 96 a2 | 関 |
e8 a5 bf | 西 | e8 a5 bf | 西 |
e3 81 ae | の | 5c 75 66 66 66 64 | � |
e3 83 8b | ニ | 5c 75 66 66 66 64 | � |
e3 83 a5 | ュ | 2e | . |
The byte sequence of a UFT-8 character is omitted in the middle of a 3-byte byte sequence, which is originally the middle of a single character, resulting in a byte sequence that cannot display the character properly.
\ufffd
is an Invalid Character
that is displayed when data appears in UTF-8 character notation that is not a normal UTF-8 byte sequence.
In this example, the e3 81 ae
-> の
byte sequence was probably truncated by two bytes, resulting in e3 81
, which was converted to an Invalid Character
and printed.
Expected behavior
Although social-app itself is currently working on multilingual support, this problem is seen as a lack of consideration for users of languages that use UTF-8 and express characters in multibyte languages.
This garbled display of link cards existed in 2023-05 when I joined Bluesky, and has been neglected for a long time. I am a Japanese speaker, but I think this is a common problem for users of languages such as Chinese, Korean, Arabic, etc., where a single character is represented by multiple bytes.
We hope you will consider this.
@Jacob2161 this is likely a cardyb issue
Thanks for reporting this. Yes, truncation is probably not being handled correctly and this will be fixed.
@Jacob2161 Hello! What is the status on resolving this issue? Is there anything I can do to help?
Users of multi-byte character-dominated languages like CJK (Chinese/Japanese/Korean) see this problem many times every day.
@noritada I hope I fixed this today. Can you give it a try and let me know if you still see issues?
@Jacob2161 Thank you very much!! I have checked with several web pages and confirmed that the issue has been resolved.