core icon indicating copy to clipboard operation
core copied to clipboard

detect url in caption ?

Open sefidpardazesh opened this issue 8 years ago • 15 comments

in bot telegram api For text messages we have entity type for detect url, mention, text_mention. But! For photo,video with caption how we detect url,mention.? In other hand how can we use entity type in caption of photo,video?

sefidpardazesh avatar Jun 21 '17 09:06 sefidpardazesh

Entities are there only for cases when updating messages (that are either html formated or use markdown) so it can be reformatted properly.

There is no such thing for caption, you will have to write a regex for this...

jacklul avatar Jun 21 '17 10:06 jacklul

thanks. what is reges for mention and text_mention?

sefidpardazesh avatar Jun 21 '17 10:06 sefidpardazesh

Entities are there only for cases when updating messages (that are either html formated or use markdown) so it can be reformatted properly.

@jacklul I'm trying to reformat an edited message, but without success. How can I use the entities to properly reformat?

KilluaFein avatar Jul 13 '17 10:07 KilluaFein

@KilluaFein proof of concept:

   private function parseEntitiesString($text, $entities)
    {
        $global_incr = 0;
        foreach ($entities as $entity) {
            if ($entity->getType() == 'italic') {
                $start = $global_incr + $entity->getOffset();
                $end = 1 + $start + $entity->getLength();

                $text = $this->mb_substr_replace($text, '_', $start, 0);
                $text = $this->mb_substr_replace($text, '_', $end, 0);

                $global_incr = $global_incr + 2;
            } elseif ($entity->getType() == 'bold') {
                $start = $global_incr + $entity->getOffset();
                $end = 1 + $start + $entity->getLength();

                $text = $this->mb_substr_replace($text, '*', $start, 0);
                $text = $this->mb_substr_replace($text, '*', $end, 0);

                $global_incr = $global_incr + 2;
            } elseif ($entity->getType() == 'code') {
                $start = $global_incr + $entity->getOffset();
                $end = 1 + $start + $entity->getLength();

                $text = $this->mb_substr_replace($text, '`', $start, 0);
                $text = $this->mb_substr_replace($text, '`', $end, 0);

                $global_incr = $global_incr + 2;
            } elseif ($entity->getType() == 'pre') {
                $start = $global_incr + $entity->getOffset();
                $end = 3 + $start + $entity->getLength();

                $text = $this->mb_substr_replace($text, '```', $start, 0);
                $text = $this->mb_substr_replace($text, '```', $end, 0);

                $global_incr = $global_incr + 6;
            } elseif ($entity->getType() == 'text_link') {
                $start = $global_incr + $entity->getOffset();
                $end = 1 + $start + $entity->getLength();
                $url = '(' . $entity->getUrl() . ')';

                $text = $this->mb_substr_replace($text, '[', $start, 0);
                $text = $this->mb_substr_replace($text, ']' . $url, $end, 0);

                $global_incr = $global_incr + 2 + mb_strlen($url);
            } elseif ($entity->getType() == 'code') {
                $start = $global_incr + $entity->getOffset();

                $text = mb_substr($text, 0, $start);
            }
        }

        return $text;
    }

Never managed to make it work for 100% cases. Multibyte characters break offsets.

jacklul avatar Jul 13 '17 14:07 jacklul

Multibyte characters break offsets.

Like emoji, right?

and what is mb_substr_replace()?

KilluaFein avatar Jul 13 '17 16:07 KilluaFein

offset and length are UTF-16 encoded, maybe a way to convert to UTF-8 to solve this?

KilluaFein avatar Jul 13 '17 16:07 KilluaFein

mb_XXX functions are for multi-byte strings (mb I guess).

It took me a lot of time thinking on this and I NEVER found a solution to properly get it to work.

jacklul avatar Jul 13 '17 21:07 jacklul

public static function processEntities (string $_text, array $_message_raw): string
    {
        $preset = [
            'bold'      => '<b>%text</b>',
            'italic'    => '<i>%text</i>',
            'text_link' => '<a href="%url">%text</a>',
            'code'      => '<code>%text</code>',
            'pre'       => '<pre>%text</pre>',
        ];

        if (!isset ($_message_raw['entities']))
        {
            return $_text;
        }

        $iterationText = $_text;
        $globalDiff    = 0;
        foreach ($_message_raw['entities'] as $entity)
        {
            $type   = $entity['type'];
            $offset = $entity['offset'] + $globalDiff;
            $length = $entity['length'];

            $pBefore = \mb_substr ($iterationText, 0, $offset);
            $pText   = \mb_substr ($iterationText, $offset, $length);
            $pAfter  = \mb_substr ($iterationText, ($offset + $length));

            // Note: str_replace() works good with utf-8 in the last php versions.
            if (isset ($preset[$type]))
            {
                // Get pattern from the preset.
                $replacedContent = $preset[$type];

                // First, replace url, in that rare case, if in the text will be the %text macros.
                if (!empty ($entity['url']))
                {
                    $replacedContent = \str_replace ('%url', $entity['url'], $replacedContent);
                }

                // Replace main text.
                $replacedContent = \str_replace ('%text', $pText, $replacedContent);

                $newText       = $pBefore . $replacedContent . $pAfter;
                $globalDiff    += (\mb_strlen ($newText) - \mb_strlen ($iterationText));
                $iterationText = $newText;
            }
        }

        return $iterationText;
    }

f77 avatar Mar 13 '18 20:03 f77

@jacklul what is actually a problem? And how to reproduce?

akalongman avatar May 10 '18 18:05 akalongman

I believe the point of this issue is to have a way to edit and reformat messags using entities field, because these do not contain formating we have to use 'entities' field for that, I never managed to create a function that could parse this and put into message string correctly because of multibyte strings...

One of simpliest examples would be button under a message that removes or add text to the message while keeping message contents (and that content cannot be obtained/generated in any other way than grabbing it from Message object).

jacklul avatar May 10 '18 19:05 jacklul

Any news on this issue? Emojis + text formatting using entities info (offset, length)

ParachainsDev avatar Dec 05 '19 01:12 ParachainsDev

I have a working version (I think), needs some further testing and then I'll release it :+1:

noplanman avatar Dec 08 '19 10:12 noplanman

My latest experiment, which I'll pack into a small package when it works 100%.

Try the class below, and use it like:

$entity_decoder = new EntityDecoder($message, 'markdown'); // or 'html'
$decoded_text   = $entity_decoder->decode();
<?php

use Longman\TelegramBot\Entities\Message;
use Longman\TelegramBot\Entities\MessageEntity;

class EntityDecoder
{
    private $entities;
    private $text;
    private $style;
    private $without_cmd;
    private $offset_correction;

    /**
     * @param Message $message     Message object to reconstruct Entities from.
     * @param string  $style       Either 'html' or 'markdown'.
     * @param bool    $without_cmd If the bot command should be included or not.
     */
    public function __construct(Message $message, string $style = 'html', bool $without_cmd = false)
    {
        $this->entities    = $message->getEntities();
        $this->text        = $message->getText($without_cmd);
        $this->style       = $style;
        $this->without_cmd = $without_cmd;
    }

    public function decode(): string
    {
        if (empty($this->entities)) {
            return $this->text;
        }

        $this->fixBotCommandEntity();

        // Reverse entities and start replacing bits from the back, to preserve offset positions.
        foreach (array_reverse($this->entities) as $entity) {
            $this->text = $this->decodeEntity($entity, $this->text);
        }

        return $this->text;
    }

    protected function fixBotCommandEntity(): void
    {
        // First entity would be the bot command, remove if necessary.
        $first_entity = reset($this->entities);
        if ($this->without_cmd && $first_entity->getType() === 'bot_command') {
            $this->offset_correction = ($first_entity->getLength() + 1);
            array_shift($this->entities);
        }
    }

    /**
     * @param MessageEntity $entity
     *
     * @return array
     */
    protected function getOffsetAndLength(MessageEntity $entity): array
    {
        static $text_byte_counts;

        if (!$text_byte_counts) {
            // https://www.php.net/manual/en/function.str-split.php#115703
            $str_split_unicode = preg_split('/(.)/us', $this->text, -1, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE);

            // Generate an array of UTF-16 encoded string lengths, which is necessary
            // to correct the offset and length values of special characters, like Emojis.
            $text_byte_counts = array_map(function ($char) {
                return strlen(mb_convert_encoding($char, 'UTF-16', 'UTF-8')) / 2;
            }, $str_split_unicode);
        }

        $offset = $entity->getOffset() - $this->offset_correction;
        $length = $entity->getLength();

        $offset += $offset - array_sum(array_slice($text_byte_counts, 0, $offset));
        $length += $length - array_sum(array_slice($text_byte_counts, $offset, $length));

        return [$offset, $length];
    }

    /**
     * @param string $style
     * @param string $type
     *
     * @return string
     */
    protected function getFiller(string $style, string $type): string
    {
        $fillers = [
            'html'     => [
                'text_mention' => '<a href="tg://user?id=%2$s">%1$s</a>',
                'text_link'    => '<a href="%2$s">%1$s</a>',
                'bold'         => '<b>%s</b>',
                'italic'       => '<i>%s</i>',
                'code'         => '<code>%s</code>',
                'pre'          => '<pre>%s</pre>',
            ],
            'markdown' => [
                'text_mention' => '[%1$s](tg://user?id=%2$s)',
                'text_link'    => '[%1$s](%2$s)',
                'bold'         => '*%s*',
                'italic'       => '_%s_',
                'code'         => '`%s`',
                'pre'          => '```%s```',
            ],
        ];

        return $fillers[$style][$type] ?? '';
    }

    /**
     * Decode an entity into the passed string.
     *
     * @param MessageEntity $entity
     * @param string        $text
     *
     * @return string
     */
    private function decodeEntity(MessageEntity $entity, string $text): string
    {
        [$offset, $length] = $this->getOffsetAndLength($entity);

        $text_bit = $this->getTextBit($entity, $offset, $length);

        // Replace text bit.
        return mb_substr($text, 0, $offset) . $text_bit . mb_substr($text, $offset + $length);
    }

    /**
     * @param MessageEntity $entity
     * @param int           $offset
     * @param int           $length
     *
     * @return false|string
     */
    private function getTextBit(MessageEntity $entity, $offset, $length)
    {
        $type     = $entity->getType();
        $filler   = $this->getFiller($this->style, $type);
        $text_bit = mb_substr($this->text, $offset, $length);

        switch ($type) {
            case 'text_mention':
                $text_bit = sprintf($filler, $text_bit, $entity->getUser()->getId());
                break;
            case 'text_link':
                $text_bit = sprintf($filler, $text_bit, $entity->getUrl());
                break;
            case 'bold':
            case 'italic':
            case 'code':
            case 'pre':
                $text_bit = sprintf($filler, $text_bit);
                break;
            default:
                break;
        }

        return $text_bit;
    }
}

noplanman avatar Dec 12 '19 10:12 noplanman

My latest experiment, which I'll pack into a small package when it works 100%.

Tested and do not see problems. A lot of emojis and different formatting works ok at the first glance.

ParachainsDev avatar Dec 12 '19 11:12 ParachainsDev