irc3 Split long messages to avoid being kicked for excess flood

Hi,

My bot sends messages to a channel whose length can sometimes be too long for the server to accept.

Would it be possible for irc3 (maybe via an opt-in mechanism) to split long message queries into several ones, to avoid the bot being kicked for excess flood? The message itself could be split on word boundaries.

This could theoretically be handled on the bot's side, but I don't think it would be very robust, as the bot doesn't know exactly what the resulting IRC query will look like (I assume something like PRIVMSG recipient :message), and thus not be able to know how many characters should be subtracted to the theoretical maximum message size (512 I believe).

HTH.

Apr 15 '21 06:04 lenormf

Hi,

You can set max_length in the config file. Default value is 512: https://github.com/gawel/irc3/blob/master/irc3/base.py#L75

Apr 15 '21 06:04 gawel

Do servers communicate what maximum message length they accept? In that case, I would like irc3 to detect that, and split messages according to it.

Apr 15 '21 06:04 lenormf

It's possible. But the rfc say 512 is fine: https://github.com/gawel/irc3/blob/2d7d65ca753fdd13b686d0ba5d4c180131ada6f6/irc3/rfc2812.txt#L300

You may be able to track this for your server by checking bot.config['server_config']

See https://github.com/gawel/irc3/blob/2d7d65ca753fdd13b686d0ba5d4c180131ada6f6/irc3/plugins/core.py#L109

Apr 15 '21 07:04 gawel

So, the server I'm connecting to is GameSurge.

Here is the configuration the server sends to the bot:

:NuclearFallout.WA.US.GameSurge.net 005 _nick WHOX WALLCHOPS WALLVOICES USERIP CPRIVMSG CNOTICE SILENCE=25 MODES=6 MAXCHANNELS=75 MAXBANS=100 NICKLEN=30 :are supported by this server
:NuclearFallout.WA.US.GameSurge.net 005 _nick MAXNICKLEN=30 TOPICLEN=300 AWAYLEN=200 KICKLEN=300 CHANNELLEN=200 MAXCHANNELLEN=200 CHANTYPES=#& PREFIX=(ov)@+ STATUSMSG=@+ CHANMODES=b,k,l,imnpstrDdRcC CASEMAPPING=rfc1459 NETWORK=GameSurge :are supported by this server

The bot subsequently prints these values as:

Server config: {'TOPICLEN': '300', 'MAXCHANNELS': '75', 'PREFIX': '(ov)@+', 'MAXCHANNELLEN': '200', 'CHANNELLEN': '200', 'MAXBANS': '100', 'CASEMAPPING': 'rfc1459', 'NETWORK': 'GameSurge', 'USERIP': True, 'WHOX': True, 'CHANMODES': 'b,k,l,imnpstrDdRcC', 'AWAYLEN': '200', 'MODES': '6', 'SILENCE': '25', 'STATUSMSG': '@+', 'MAXNICKLEN': '30', 'KICKLEN': '300', 'WALLVOICES': True, 'NICKLEN': '30', 'CPRIVMSG': True, 'WALLCHOPS': True, 'CHANTYPES': '#&', 'CNOTICE': True}

The 'max_length' option in bot.config is set to 512.

When the bot sends a message that is >512 characters long (I can see that message in the debug log), the server replies:

:NuclearFallout.WA.US.GameSurge.net 417 _nick :Input line was too long

That could mean that either:

irc3 is not splitting messages to make sure packets remain below 512 characters — if it's not supposed to at the moment, I'd like it to do so
GameSurge uses a maximum length lesser than 512 and doesn't communicate it to irc3 to split messages — I haven't tried to verify that, it's possible that GameSurge uses 512 as limit

HTH.

Apr 15 '21 07:04 lenormf

As I remember you can track what the bot receive/send by using irc3 -vdr config.ini

Apr 15 '21 08:04 gawel

After debugging the issue, I found two problems, and one suspicious thing:

the IrcBot.privmsg() function assumes that messages (i.e. words sent by users) must be self.config.max_length in length, when that number applies to the whole IRC message (i.e. PRIVMSG target :message\r\n) — other parts of the code that deal with protocol might have this issue as well, I didn't check

https://github.com/gawel/irc3/blob/2d7d65ca753fdd13b686d0ba5d4c180131ada6f6/irc3/rfc2812.txt#L272-L274

https://github.com/gawel/irc3/blob/2d7d65ca753fdd13b686d0ba5d4c180131ada6f6/irc3/rfc2812.txt#L300-L304

the length of the IRC message is supposed to be counted in bytes (ASCII or encoded UTF8), but utils.split_messages() works on unicode strings, which can throw off the count — I couldn't really find anything in the RFC about that after a cursory look, maybe some servers (other than GameSurge) are lenient and count codepoints?
the bot sends messages one after the other without any form of rate limiting, which results in it being kicked for excess flood, once the above problems are solved — I can see a flood_delay local configuration option, but from what I can tell the server doesn't seem to indicate what minimum delay should separate messages that it could be compared to

Here is a debugging patch that I used to diagnose the above issues:

diff --git a/irc3/__init__.py b/irc3/__init__.py
index 2cd5fe7..12ee2c9 100644
--- a/irc3/__init__.py
+++ b/irc3/__init__.py
@@ -179,6 +179,7 @@ class IrcBot(base.IrcObject):
 
     def send_line(self, data, nowait=False):
         """send a line to the server. replace CR by spaces"""
+        self.log.debug("DBG: (%d) %s", len(data), data)
         data = data.replace('\n', ' ').replace('\r', ' ')
         f = asyncio.Future(loop=self.loop)
         if self.queue is not None and nowait is False:
@@ -225,15 +226,23 @@ class IrcBot(base.IrcObject):
     def privmsg(self, target, message, nowait=False):
         """send a privmsg to target"""
         if message:
-            messages = utils.split_message(message, self.config.max_length)
             if isinstance(target, DCCChat):
+                messages = utils.split_message(message, self.config.max_length)
                 for message in messages:
                     target.send_line(message)
             elif target:
+                command_prefix = 'PRIVMSG %s :' % target
+                self.log.debug("MAX_LENGTH: %d" % self.config.max_length)
+                assert self.config.max_length > len(command_prefix) + 2
+                self.log.debug("SPLIT: %s" % message)
+                messages = utils.split_message(message, self.config.max_length - len(command_prefix) - 2, self.log)
+                self.log.debug("MESSAGES: %s", messages)
                 f = None
                 for message in messages:
-                    f = self.send_line('PRIVMSG %s :%s' % (target, message),
+                    self.log.debug("DBG: MSG (%d) %s", len(message), message)
+                    f = self.send_line('%s%s' % (command_prefix, message),
                                        nowait=nowait)
+                    break
                 return f
 
     def action(self, target, message, nowait=False):
diff --git a/irc3/utils.py b/irc3/utils.py
index 4f158f8..fb91620 100644
--- a/irc3/utils.py
+++ b/irc3/utils.py
@@ -164,8 +164,38 @@ class IrcString(BaseString):
 STRIPPED_CHARS = '\t '
 
 
-def split_message(message, max_length):
+def split_message(message, max_length, log=None):
     """Split long messages"""
+    def utf8_lead_byte(b):
+        '''A UTF-8 intermediate byte starts with the bits 10xxxxxx.'''
+        return (b & 0xC0) != 0x80
+
+    def utf8_byte_truncate(text, max_bytes):
+        '''If text[max_bytes] is not a lead byte, back up until a lead byte is
+        found and truncate before that character.'''
+        log.debug("IN")
+        utf8 = text.encode('utf8')
+        log.debug("utf8: %s", utf8)
+        if len(utf8) <= max_bytes:
+            log.debug("SPLIT: max bytes")
+            return utf8
+        log.debug("max_bytes: %d", max_bytes)
+        i = max_bytes
+        while i > 0 and not utf8_lead_byte(utf8[i]):
+            log.debug("I: %d", i)
+            i -= 1
+        log.debug("RET")
+        return utf8[:i]
+
+    if log is not None:
+        log.debug("SPLIT MSG: %d - %s", max_length, message)
+        s = utf8_byte_truncate(message, max_length)
+        log.debug("S: %s", s)
+        s = s.decode('utf8')
+        log.debug("DECODED")
+        yield s
+        return
+
     if len(message) > max_length:
         for message in textwrap.wrap(message, max_length):
             yield message

I used this implementation to split strings according to byte length: https://stackoverflow.com/a/43848928

HTH.

Apr 16 '21 07:04 lenormf

Thanks for digging into that. I don't like the idea of encoding/decoding the whole messages to... re-encode them before sending to the server. encoding/decoding are very CPU expensive AFAIK.

Apr 21 '21 12:04 gawel

Any chance we'll get a fix soon? My bot trips on half of the URLs (apparently the internet has decided that having 1k+ characters long sentences in the metadata is a good idea).

I understand the concern about the encoding round trips, however there doesn't seem to be any alternative that I can see.

May 14 '21 04:05 lenormf

I'll be ok with that if there is no need to re-decode the data. I'm pretty sure a few if isinstance(data, bytes) will be faster than decoding/encoding the whole messages multiple times. Don't know how hard it is to add those if and how many of them are required.

I'll try to give it a try soon...

May 14 '21 08:05 gawel

For now I've added a prefix argument to split_message (inspired by your snippet). This should reduce your amount of errors. Maybe it's enough to "fix" the problem if you set max_length to an appropriate value to allow some extra unicode characters.. Maybe 500 instead of 512.

Looks like supporting both unicode and bytes in the whole process will not be easy at all..

May 14 '21 10:05 gawel