Telethon
Telethon copied to clipboard
The URL of `MessageEntityTextUrl` undesurrogated if `parse_mode` is html
Checklist
- [x] The error is in the library's code, and not in my own.
- [x] I have searched for this issue before posting it and there isn't a duplicate.
- [ ] I ran
pip install -U https://github.com/LonamiWebs/Telethon/archive/master.zip
and triggered the bug in the latest version. (Well, since v2 is unfinished, it seems broken and unable to run. I managed to do some tests, please read the last two sections)
Code that causes the issue
import asyncio
from telethon import TelegramClient
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
bot = TelegramClient('bot', API_ID, API_HASH, proxy=PROXY).start(bot_token=TOKEN)
async def main():
await bot.send_message(
entity=USER,
message='[ππΌπΉπ±, ππ΅π’ππͺπ€, π½π€π‘π ππ£π ππ©ππ‘ππ](https://telegra.ph/ππΌπΉπ±-ππ΅π’ππͺπ€-π½π€π‘π-ππ£π-ππ©ππ‘ππ-01-30)',
parse_mode='md'
)
print('OK, message sent (parse_mode=md).')
await bot.send_message(
entity=USER,
message='<a href="https://telegra.ph/ππΌπΉπ±-ππ΅π’ππͺπ€-π½π€π‘π-ππ£π-ππ©ππ‘ππ-01-30">ππΌπΉπ±, ππ΅π’ππͺπ€, π½π€π‘π ππ£π ππ©ππ‘ππ</a>',
parse_mode='html'
)
print('OK, message sent (parse_mode=html).')
if __name__ == '__main__':
loop.run_until_complete(main())
Traceback
OK, message sent (parse_mode=md).
Traceback (most recent call last):
File "***/send_msg_telethon.py", line 43, in <module>
loop.run_until_complete(main())
File "/usr/lib/python3.9/asyncio/base_events.py", line 642, in run_until_complete
return future.result()
File "***/send_msg_telethon.py", line 31, in main
await bot.send_message(
File "/usr/local/lib/python3.9/dist-packages/telethon/client/messages.py", line 853, in send_message
result = await self(request)
File "/usr/local/lib/python3.9/dist-packages/telethon/client/users.py", line 30, in __call__
return await self._call(self._sender, request, ordered=ordered)
File "/usr/local/lib/python3.9/dist-packages/telethon/client/users.py", line 58, in _call
future = sender.send(request, ordered=ordered)
File "/usr/local/lib/python3.9/dist-packages/telethon/network/mtprotosender.py", line 176, in send
state = RequestState(request)
File "/usr/local/lib/python3.9/dist-packages/telethon/network/requeststate.py", line 17, in __init__
self.data = bytes(request)
File "/usr/local/lib/python3.9/dist-packages/telethon/tl/tlobject.py", line 194, in __bytes__
return self._bytes()
File "/usr/local/lib/python3.9/dist-packages/telethon/tl/functions/messages.py", line 4667, in _bytes
b'' if self.entities is None or self.entities is False else b''.join((b'\x15\xc4\xb5\x1c',struct.pack('<i', len(self.entities)),b''.join(x._bytes() for x in self.entities))),
File "/usr/local/lib/python3.9/dist-packages/telethon/tl/functions/messages.py", line 4667, in <genexpr>
b'' if self.entities is None or self.entities is False else b''.join((b'\x15\xc4\xb5\x1c',struct.pack('<i', len(self.entities)),b''.join(x._bytes() for x in self.entities))),
File "/usr/local/lib/python3.9/dist-packages/telethon/tl/types/__init__.py", line 14821, in _bytes
self.serialize_bytes(self.url),
File "/usr/local/lib/python3.9/dist-packages/telethon/tl/tlobject.py", line 110, in serialize_bytes
data = data.encode('utf-8')
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 19-26: surrogates not allowed
Note
v1.24.0
If parse_mode
is markdown, the URL will be desurrogated (extensions/markdown.py#L128-L131):
result.append(MessageEntityTextUrl(
offset=m.start(), length=len(m.group(1)),
url=del_surrogate(m.group(2))
))
If parse_mode
is html, the URL will remain surrogated.
Thus, if the URL contains some characters need to be surrogated (e.g. some Unicode symbols) and parse_mode
is html, UnicodeEncodeError
will be raised (tl.types.MessageEntityTextUrl
):
def _bytes(self):
return b''.join((
b"'\xd3\xa6v",
struct.pack('<i', self.offset),
struct.pack('<i', self.length),
self.serialize_bytes(self.url), # `self.url` will be `.encode('utf-8')` and cause a `UnicodeEncodeError`
))
HTTP Bot API
However, HTTP Bot API can deal with Unicode symbols properly both in html and markdown parse mode:
https://api.telegram.org/bot<REDACTED>/sendMessage?chat_id=<REDACTED>&text=[ππΌπΉπ±, ππ΅π’ππͺπ€, π½π€π‘π ππ£π ππ©ππ‘ππ](https://telegra.ph/ππΌπΉπ±-ππ΅π’ππͺπ€-π½π€π‘π-ππ£π-ππ©ππ‘ππ-01-30)&parse_mode=markdown
https://api.telegram.org/bot<REDACTED>/sendMessage?chat_id=<REDACTED>&text=<a href="https://telegra.ph/ππΌπΉπ±-ππ΅π’ππͺπ€-π½π€π‘π-ππ£π-ππ©ππ‘ππ-01-30">ππΌπΉπ±, ππ΅π’ππͺπ€, π½π€π‘π ππ£π ππ©ππ‘ππ</a>&parse_mode=html
Response:
{
"ok": true,
"result": {
"message_id": <REDACTED>,
"from": <REDACTED>,
"chat": <REDACTED>,
"date": <REDACTED>,
"text": "ππΌπΉπ±, ππ΅π’ππͺπ€, π½π€π‘π ππ£π ππ©ππ‘ππ",
"entities": [
{
"offset": 0,
"length": 52,
"type": "text_link",
"url": "https://telegra.ph/ππΌπΉπ±-ππ΅π’ππͺπ€-π½π€π‘π-ππ£π-ππ©ππ‘ππ-01-30"
}
]
}
}
v2
It is unfinished and broken now. But I patched a bit to make _misc.html.parse
and _misc.markdown.parse
operational.
_misc.html.parse('<a href="https://telegra.ph/ππΌπΉπ±-ππ΅π’ππͺπ€-π½π€π‘π-ππ£π-ππ©ππ‘ππ-01-30">ππΌπΉπ±, ππ΅π’ππͺπ€, π½π€π‘π ππ£π ππ©ππ‘ππ</a>')
: Well, still broken.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "***/telethon/_misc/html.py", line 132, in parse
parser.feed(_add_surrogate(html))
File "/usr/lib/python3.9/html/parser.py", line 110, in feed
self.goahead(0)
File "/usr/lib/python3.9/html/parser.py", line 162, in goahead
self.handle_data(unescape(rawdata[i:j]))
File "***/telethon/_misc/html.py", line 105, in handle_data
entity.length += len(text)
File "<string>", line 4, in __setattr__
dataclasses.FrozenInstanceError: cannot assign to field 'length'
_misc.markdown.parse('[ππΌπΉπ±, ππ΅π’ππͺπ€, π½π€π‘π ππ£π ππ©ππ‘ππ](https://telegra.ph/ππΌπΉπ±-ππ΅π’ππͺπ€-π½π€π‘π-ππ£π-ππ©ππ‘ππ-01-30)')
: Oops, even raise a UnicodeEncodeError
in the parsing stage (for v1.24.0, it is raised in the request constructing stage).
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "***/telethon/_misc/markdown.py", line 76, in parse
parsed = MARKDOWN.parse(add_surrogate(message.strip()))
File "/usr/local/lib/python3.9/dist-packages/markdown_it/main.py", line 260, in parse
self.core.process(state)
File "/usr/local/lib/python3.9/dist-packages/markdown_it/parser_core.py", line 33, in process
rule(state)
File "/usr/local/lib/python3.9/dist-packages/markdown_it/rules_core/inline.py", line 10, in inline
state.md.inline.parse(token.content, state.md, state.env, token.children)
File "/usr/local/lib/python3.9/dist-packages/markdown_it/parser_inline.py", line 120, in parse
self.tokenize(state)
File "/usr/local/lib/python3.9/dist-packages/markdown_it/parser_inline.py", line 102, in tokenize
ok = rule(state, False)
File "/usr/local/lib/python3.9/dist-packages/markdown_it/rules_inline/link.py", line 54, in link
href = state.md.normalizeLink(res.str)
File "/usr/local/lib/python3.9/dist-packages/markdown_it/main.py", line 331, in normalizeLink
return normalize_url.normalizeLink(url)
File "/usr/local/lib/python3.9/dist-packages/markdown_it/common/normalize_url.py", line 36, in normalizeLink
return mdurl.encode(mdurl.format(parsed))
File "/usr/local/lib/python3.9/dist-packages/mdurl/_encode.py", line 72, in encode
result += encode_uri_component(string[i] + string[i + 1])
File "/usr/lib/python3.9/urllib/parse.py", line 856, in quote
string = string.encode(encoding, errors)
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 0-1: surrogates not allowed
Conclusion
I know this is an invalid html, and even in markdown, un-urlencoded URL should be avoided, but:
-
MessageEntityTextUrl
with an un-urlencoded URL is valid and official Telegram clients can deal with it properly. - Telegram Bot API can deal with such an html/markdown too.
-
parse_mode
is just an efficient tool based on formatting entities. If an html/markdown has no syntax error and can produce valid entities, we should accept it.
Telethon always surrogates the whole string before parsing it, to make Telegram offsets calculating easier. It causes the content of entities to be surrogated too. This issue is about MessageEntityTextUrl
, but more potential bugs may exist.