Telethon icon indicating copy to clipboard operation
Telethon copied to clipboard

The URL of `MessageEntityTextUrl` undesurrogated if `parse_mode` is html

Open Rongronggg9 opened this issue 2 years ago β€’ 0 comments

Checklist

  • [x] The error is in the library's code, and not in my own.
  • [x] I have searched for this issue before posting it and there isn't a duplicate.
  • [ ] I ran pip install -U https://github.com/LonamiWebs/Telethon/archive/master.zip and triggered the bug in the latest version. (Well, since v2 is unfinished, it seems broken and unable to run. I managed to do some tests, please read the last two sections)

Code that causes the issue

import asyncio
from telethon import TelegramClient

loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)

bot = TelegramClient('bot', API_ID, API_HASH, proxy=PROXY).start(bot_token=TOKEN)


async def main():
    await bot.send_message(
        entity=USER,
        message='[𝗕𝗼𝗹𝗱, 𝘐𝘡𝘒𝘭π˜ͺ𝘀, π˜½π™€π™‘π™™ 𝙖𝙣𝙙 π™žπ™©π™–π™‘π™žπ™˜](https://telegra.ph/𝗕𝗼𝗹𝗱-𝘐𝘡𝘒𝘭π˜ͺ𝘀-π˜½π™€π™‘π™™-𝙖𝙣𝙙-π™žπ™©π™–π™‘π™žπ™˜-01-30)',
        parse_mode='md'
    )
    print('OK, message sent (parse_mode=md).')
    await bot.send_message(
        entity=USER,
        message='<a href="https://telegra.ph/𝗕𝗼𝗹𝗱-𝘐𝘡𝘒𝘭π˜ͺ𝘀-π˜½π™€π™‘π™™-𝙖𝙣𝙙-π™žπ™©π™–π™‘π™žπ™˜-01-30">𝗕𝗼𝗹𝗱, 𝘐𝘡𝘒𝘭π˜ͺ𝘀, π˜½π™€π™‘π™™ 𝙖𝙣𝙙 π™žπ™©π™–π™‘π™žπ™˜</a>',
        parse_mode='html'
    )
    print('OK, message sent (parse_mode=html).')


if __name__ == '__main__':
    loop.run_until_complete(main())

Traceback

OK, message sent (parse_mode=md).
Traceback (most recent call last):
  File "***/send_msg_telethon.py", line 43, in <module>
    loop.run_until_complete(main())
  File "/usr/lib/python3.9/asyncio/base_events.py", line 642, in run_until_complete
    return future.result()
  File "***/send_msg_telethon.py", line 31, in main
    await bot.send_message(
  File "/usr/local/lib/python3.9/dist-packages/telethon/client/messages.py", line 853, in send_message
    result = await self(request)
  File "/usr/local/lib/python3.9/dist-packages/telethon/client/users.py", line 30, in __call__
    return await self._call(self._sender, request, ordered=ordered)
  File "/usr/local/lib/python3.9/dist-packages/telethon/client/users.py", line 58, in _call
    future = sender.send(request, ordered=ordered)
  File "/usr/local/lib/python3.9/dist-packages/telethon/network/mtprotosender.py", line 176, in send
    state = RequestState(request)
  File "/usr/local/lib/python3.9/dist-packages/telethon/network/requeststate.py", line 17, in __init__
    self.data = bytes(request)
  File "/usr/local/lib/python3.9/dist-packages/telethon/tl/tlobject.py", line 194, in __bytes__
    return self._bytes()
  File "/usr/local/lib/python3.9/dist-packages/telethon/tl/functions/messages.py", line 4667, in _bytes
    b'' if self.entities is None or self.entities is False else b''.join((b'\x15\xc4\xb5\x1c',struct.pack('<i', len(self.entities)),b''.join(x._bytes() for x in self.entities))),
  File "/usr/local/lib/python3.9/dist-packages/telethon/tl/functions/messages.py", line 4667, in <genexpr>
    b'' if self.entities is None or self.entities is False else b''.join((b'\x15\xc4\xb5\x1c',struct.pack('<i', len(self.entities)),b''.join(x._bytes() for x in self.entities))),
  File "/usr/local/lib/python3.9/dist-packages/telethon/tl/types/__init__.py", line 14821, in _bytes
    self.serialize_bytes(self.url),
  File "/usr/local/lib/python3.9/dist-packages/telethon/tl/tlobject.py", line 110, in serialize_bytes
    data = data.encode('utf-8')
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 19-26: surrogates not allowed

Note

v1.24.0

If parse_mode is markdown, the URL will be desurrogated (extensions/markdown.py#L128-L131):

                result.append(MessageEntityTextUrl(
                    offset=m.start(), length=len(m.group(1)),
                    url=del_surrogate(m.group(2))
                ))

If parse_mode is html, the URL will remain surrogated.

Thus, if the URL contains some characters need to be surrogated (e.g. some Unicode symbols) and parse_mode is html, UnicodeEncodeError will be raised (tl.types.MessageEntityTextUrl):

    def _bytes(self):
        return b''.join((
            b"'\xd3\xa6v",
            struct.pack('<i', self.offset),
            struct.pack('<i', self.length),
            self.serialize_bytes(self.url),  # `self.url` will be `.encode('utf-8')` and cause a `UnicodeEncodeError`
        ))

HTTP Bot API

However, HTTP Bot API can deal with Unicode symbols properly both in html and markdown parse mode:

https://api.telegram.org/bot<REDACTED>/sendMessage?chat_id=<REDACTED>&text=[𝗕𝗼𝗹𝗱, 𝘐𝘡𝘒𝘭π˜ͺ𝘀, π˜½π™€π™‘π™™ 𝙖𝙣𝙙 π™žπ™©π™–π™‘π™žπ™˜](https://telegra.ph/𝗕𝗼𝗹𝗱-𝘐𝘡𝘒𝘭π˜ͺ𝘀-π˜½π™€π™‘π™™-𝙖𝙣𝙙-π™žπ™©π™–π™‘π™žπ™˜-01-30)&parse_mode=markdown
https://api.telegram.org/bot<REDACTED>/sendMessage?chat_id=<REDACTED>&text=<a href="https://telegra.ph/𝗕𝗼𝗹𝗱-𝘐𝘡𝘒𝘭π˜ͺ𝘀-π˜½π™€π™‘π™™-𝙖𝙣𝙙-π™žπ™©π™–π™‘π™žπ™˜-01-30">𝗕𝗼𝗹𝗱, 𝘐𝘡𝘒𝘭π˜ͺ𝘀, π˜½π™€π™‘π™™ 𝙖𝙣𝙙 π™žπ™©π™–π™‘π™žπ™˜</a>&parse_mode=html

Response:

{
  "ok": true,
  "result": {
    "message_id": <REDACTED>,
    "from": <REDACTED>,
    "chat": <REDACTED>,
    "date": <REDACTED>,
    "text": "𝗕𝗼𝗹𝗱, 𝘐𝘡𝘒𝘭π˜ͺ𝘀, π˜½π™€π™‘π™™ 𝙖𝙣𝙙 π™žπ™©π™–π™‘π™žπ™˜",
    "entities": [
      {
        "offset": 0,
        "length": 52,
        "type": "text_link",
        "url": "https://telegra.ph/𝗕𝗼𝗹𝗱-𝘐𝘡𝘒𝘭π˜ͺ𝘀-π˜½π™€π™‘π™™-𝙖𝙣𝙙-π™žπ™©π™–π™‘π™žπ™˜-01-30"
      }
    ]
  }
}

v2

It is unfinished and broken now. But I patched a bit to make _misc.html.parse and _misc.markdown.parse operational.

_misc.html.parse('<a href="https://telegra.ph/𝗕𝗼𝗹𝗱-𝘐𝘡𝘒𝘭π˜ͺ𝘀-π˜½π™€π™‘π™™-𝙖𝙣𝙙-π™žπ™©π™–π™‘π™žπ™˜-01-30">𝗕𝗼𝗹𝗱, 𝘐𝘡𝘒𝘭π˜ͺ𝘀, π˜½π™€π™‘π™™ 𝙖𝙣𝙙 π™žπ™©π™–π™‘π™žπ™˜</a>'): Well, still broken.

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "***/telethon/_misc/html.py", line 132, in parse
    parser.feed(_add_surrogate(html))
  File "/usr/lib/python3.9/html/parser.py", line 110, in feed
    self.goahead(0)
  File "/usr/lib/python3.9/html/parser.py", line 162, in goahead
    self.handle_data(unescape(rawdata[i:j]))
  File "***/telethon/_misc/html.py", line 105, in handle_data
    entity.length += len(text)
  File "<string>", line 4, in __setattr__
dataclasses.FrozenInstanceError: cannot assign to field 'length'

_misc.markdown.parse('[𝗕𝗼𝗹𝗱, 𝘐𝘡𝘒𝘭π˜ͺ𝘀, π˜½π™€π™‘π™™ 𝙖𝙣𝙙 π™žπ™©π™–π™‘π™žπ™˜](https://telegra.ph/𝗕𝗼𝗹𝗱-𝘐𝘡𝘒𝘭π˜ͺ𝘀-π˜½π™€π™‘π™™-𝙖𝙣𝙙-π™žπ™©π™–π™‘π™žπ™˜-01-30)'): Oops, even raise a UnicodeEncodeError in the parsing stage (for v1.24.0, it is raised in the request constructing stage).

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "***/telethon/_misc/markdown.py", line 76, in parse
    parsed = MARKDOWN.parse(add_surrogate(message.strip()))
  File "/usr/local/lib/python3.9/dist-packages/markdown_it/main.py", line 260, in parse
    self.core.process(state)
  File "/usr/local/lib/python3.9/dist-packages/markdown_it/parser_core.py", line 33, in process
    rule(state)
  File "/usr/local/lib/python3.9/dist-packages/markdown_it/rules_core/inline.py", line 10, in inline
    state.md.inline.parse(token.content, state.md, state.env, token.children)
  File "/usr/local/lib/python3.9/dist-packages/markdown_it/parser_inline.py", line 120, in parse
    self.tokenize(state)
  File "/usr/local/lib/python3.9/dist-packages/markdown_it/parser_inline.py", line 102, in tokenize
    ok = rule(state, False)
  File "/usr/local/lib/python3.9/dist-packages/markdown_it/rules_inline/link.py", line 54, in link
    href = state.md.normalizeLink(res.str)
  File "/usr/local/lib/python3.9/dist-packages/markdown_it/main.py", line 331, in normalizeLink
    return normalize_url.normalizeLink(url)
  File "/usr/local/lib/python3.9/dist-packages/markdown_it/common/normalize_url.py", line 36, in normalizeLink
    return mdurl.encode(mdurl.format(parsed))
  File "/usr/local/lib/python3.9/dist-packages/mdurl/_encode.py", line 72, in encode
    result += encode_uri_component(string[i] + string[i + 1])
  File "/usr/lib/python3.9/urllib/parse.py", line 856, in quote
    string = string.encode(encoding, errors)
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 0-1: surrogates not allowed

Conclusion

I know this is an invalid html, and even in markdown, un-urlencoded URL should be avoided, but:

  1. MessageEntityTextUrl with an un-urlencoded URL is valid and official Telegram clients can deal with it properly.
  2. Telegram Bot API can deal with such an html/markdown too.
  3. parse_mode is just an efficient tool based on formatting entities. If an html/markdown has no syntax error and can produce valid entities, we should accept it.

Telethon always surrogates the whole string before parsing it, to make Telegram offsets calculating easier. It causes the content of entities to be surrogated too. This issue is about MessageEntityTextUrl, but more potential bugs may exist.

Rongronggg9 avatar Jan 30 '22 09:01 Rongronggg9