telegram-bot-api icon indicating copy to clipboard operation
telegram-bot-api copied to clipboard

Message.Entities returns Length of UTF16 encoded string, not UTF8 supported by Golang

Open Fef0 opened this issue 5 years ago • 3 comments

How I discovered it

I wanted to get the text + emoji that contained a particular link, but I always got the right Offset with a wrong Length (which is correct for UTF16, but not for my original string in UTF8). Telegram uses UTF16 encoding for calculating Length and Offset so when just ASCII text is used there are no problems at all, since ASCII always uses 1 byte for each character. Once an Emoji is used, due to emojis different sizes, the calculation starts to be wrong.

How I solved this particular problem

I used the unicode/utf16 library in order to encode the original text, extract the text I wanted and then convert it to a UTF8 string again.

The Code

Given update of Update type, I wanted to extract each text with an embedded link by using Entities attribute. The original message was "➡️Click Me⬅️ or ➡️Click Me⬅️" with "https://www.example.com/" embedded on both (just as a test).

Not Working Code

Using the following code (not using unicode/utf16):

fmt.Println(*update.ChannelPost.Entities)
for _, e := range *update.ChannelPost.Entities {
	// Get the whole update Text
	str := update.ChannelPost.Text
        // Get the text I need 
        str = str[e.Offset : e.Offset+e.Length]
	fmt.Println(str)
}

Output

[{text_link 0 12 https://www.example.com/ <nil>} {text_link 16 12 https://www.example.com/ <nil>}]
➡️Click 
�️ or ➡�

As you can see the second Emoji of the first element isn't just there, while the second element is just broken.

Working Code

The following is a piece of code that totally works (using unicode/utf16):

fmt.Println(*update.ChannelPost.Entities)
// For each entity
for _, e := range *update.ChannelPost.Entities {
	// Get the whole update Text
	str := update.ChannelPost.Text
	// Encode it into utf16
	utfEncodedString := utf16.Encode([]rune(str))
	// Decode just the piece of string I need
	runeString := utf16.Decode(utfEncodedString[e.Offset : e.Offset+e.Length])
	// Transform []rune into string
	str = string(runeString)
	fmt.Println(str)
}

Output

[{text_link 0 12 https://www.example.com/ <nil>} {text_link 16 12 https://www.example.com/ <nil>}]
➡️Click Me⬅️
➡️Click Me⬅️

Elements are just as they should be.

Conclusion

As you can see the Offset and Length are always the same and are actually correct when using UTF16. Hope it will help anyone having the same issue!

Fef0 avatar May 03 '19 00:05 Fef0

If you want convert entities []tgbotapi.MessageEntity to Markdown text, here is an example for Telebot library. And here is an example for the telegram-bot-api that converts entities to Discord markdown (with test).

shtrih avatar Nov 25 '21 16:11 shtrih

Discord

kingUFU avatar Nov 09 '22 11:11 kingUFU

The issue is also present internally, on the CommandWithAt() function in types.go:681

Pato05 avatar Jan 29 '23 21:01 Pato05