Split long messages to avoid being kicked for excess flood
Hi,
My bot sends messages to a channel whose length can sometimes be too long for the server to accept.
Would it be possible for irc3 (maybe via an opt-in mechanism) to split long message queries into several ones, to avoid the bot being kicked for excess flood? The message itself could be split on word boundaries.
This could theoretically be handled on the bot's side, but I don't think it would be very robust, as the bot doesn't know exactly what the resulting IRC query will look like (I assume something like PRIVMSG recipient :message), and thus not be able to know how many characters should be subtracted to the theoretical maximum message size (512 I believe).
HTH.
Hi,
You can set max_length in the config file. Default value is 512: https://github.com/gawel/irc3/blob/master/irc3/base.py#L75
Do servers communicate what maximum message length they accept? In that case, I would like irc3 to detect that, and split messages according to it.
It's possible. But the rfc say 512 is fine: https://github.com/gawel/irc3/blob/2d7d65ca753fdd13b686d0ba5d4c180131ada6f6/irc3/rfc2812.txt#L300
You may be able to track this for your server by checking bot.config['server_config']
See https://github.com/gawel/irc3/blob/2d7d65ca753fdd13b686d0ba5d4c180131ada6f6/irc3/plugins/core.py#L109
So, the server I'm connecting to is GameSurge.
Here is the configuration the server sends to the bot:
:NuclearFallout.WA.US.GameSurge.net 005 _nick WHOX WALLCHOPS WALLVOICES USERIP CPRIVMSG CNOTICE SILENCE=25 MODES=6 MAXCHANNELS=75 MAXBANS=100 NICKLEN=30 :are supported by this server
:NuclearFallout.WA.US.GameSurge.net 005 _nick MAXNICKLEN=30 TOPICLEN=300 AWAYLEN=200 KICKLEN=300 CHANNELLEN=200 MAXCHANNELLEN=200 CHANTYPES=#& PREFIX=(ov)@+ STATUSMSG=@+ CHANMODES=b,k,l,imnpstrDdRcC CASEMAPPING=rfc1459 NETWORK=GameSurge :are supported by this server
The bot subsequently prints these values as:
Server config: {'TOPICLEN': '300', 'MAXCHANNELS': '75', 'PREFIX': '(ov)@+', 'MAXCHANNELLEN': '200', 'CHANNELLEN': '200', 'MAXBANS': '100', 'CASEMAPPING': 'rfc1459', 'NETWORK': 'GameSurge', 'USERIP': True, 'WHOX': True, 'CHANMODES': 'b,k,l,imnpstrDdRcC', 'AWAYLEN': '200', 'MODES': '6', 'SILENCE': '25', 'STATUSMSG': '@+', 'MAXNICKLEN': '30', 'KICKLEN': '300', 'WALLVOICES': True, 'NICKLEN': '30', 'CPRIVMSG': True, 'WALLCHOPS': True, 'CHANTYPES': '#&', 'CNOTICE': True}
The 'max_length' option in bot.config is set to 512.
When the bot sends a message that is >512 characters long (I can see that message in the debug log), the server replies:
:NuclearFallout.WA.US.GameSurge.net 417 _nick :Input line was too long
That could mean that either:
irc3is not splitting messages to make sure packets remain below 512 characters — if it's not supposed to at the moment, I'd like it to do so- GameSurge uses a maximum length lesser than 512 and doesn't communicate it to
irc3to split messages — I haven't tried to verify that, it's possible that GameSurge uses 512 as limit
HTH.
As I remember you can track what the bot receive/send by using irc3 -vdr config.ini
After debugging the issue, I found two problems, and one suspicious thing:
- the
IrcBot.privmsg()function assumes that messages (i.e. words sent by users) must beself.config.max_lengthin length, when that number applies to the whole IRC message (i.e.PRIVMSG target :message\r\n) — other parts of the code that deal with protocol might have this issue as well, I didn't check
https://github.com/gawel/irc3/blob/2d7d65ca753fdd13b686d0ba5d4c180131ada6f6/irc3/rfc2812.txt#L272-L274
https://github.com/gawel/irc3/blob/2d7d65ca753fdd13b686d0ba5d4c180131ada6f6/irc3/rfc2812.txt#L300-L304
-
the length of the IRC message is supposed to be counted in bytes (ASCII or encoded UTF8), but
utils.split_messages()works on unicode strings, which can throw off the count — I couldn't really find anything in the RFC about that after a cursory look, maybe some servers (other than GameSurge) are lenient and count codepoints? -
the bot sends messages one after the other without any form of rate limiting, which results in it being kicked for excess flood, once the above problems are solved — I can see a
flood_delaylocal configuration option, but from what I can tell the server doesn't seem to indicate what minimum delay should separate messages that it could be compared to
Here is a debugging patch that I used to diagnose the above issues:
diff --git a/irc3/__init__.py b/irc3/__init__.py
index 2cd5fe7..12ee2c9 100644
--- a/irc3/__init__.py
+++ b/irc3/__init__.py
@@ -179,6 +179,7 @@ class IrcBot(base.IrcObject):
def send_line(self, data, nowait=False):
"""send a line to the server. replace CR by spaces"""
+ self.log.debug("DBG: (%d) %s", len(data), data)
data = data.replace('\n', ' ').replace('\r', ' ')
f = asyncio.Future(loop=self.loop)
if self.queue is not None and nowait is False:
@@ -225,15 +226,23 @@ class IrcBot(base.IrcObject):
def privmsg(self, target, message, nowait=False):
"""send a privmsg to target"""
if message:
- messages = utils.split_message(message, self.config.max_length)
if isinstance(target, DCCChat):
+ messages = utils.split_message(message, self.config.max_length)
for message in messages:
target.send_line(message)
elif target:
+ command_prefix = 'PRIVMSG %s :' % target
+ self.log.debug("MAX_LENGTH: %d" % self.config.max_length)
+ assert self.config.max_length > len(command_prefix) + 2
+ self.log.debug("SPLIT: %s" % message)
+ messages = utils.split_message(message, self.config.max_length - len(command_prefix) - 2, self.log)
+ self.log.debug("MESSAGES: %s", messages)
f = None
for message in messages:
- f = self.send_line('PRIVMSG %s :%s' % (target, message),
+ self.log.debug("DBG: MSG (%d) %s", len(message), message)
+ f = self.send_line('%s%s' % (command_prefix, message),
nowait=nowait)
+ break
return f
def action(self, target, message, nowait=False):
diff --git a/irc3/utils.py b/irc3/utils.py
index 4f158f8..fb91620 100644
--- a/irc3/utils.py
+++ b/irc3/utils.py
@@ -164,8 +164,38 @@ class IrcString(BaseString):
STRIPPED_CHARS = '\t '
-def split_message(message, max_length):
+def split_message(message, max_length, log=None):
"""Split long messages"""
+ def utf8_lead_byte(b):
+ '''A UTF-8 intermediate byte starts with the bits 10xxxxxx.'''
+ return (b & 0xC0) != 0x80
+
+ def utf8_byte_truncate(text, max_bytes):
+ '''If text[max_bytes] is not a lead byte, back up until a lead byte is
+ found and truncate before that character.'''
+ log.debug("IN")
+ utf8 = text.encode('utf8')
+ log.debug("utf8: %s", utf8)
+ if len(utf8) <= max_bytes:
+ log.debug("SPLIT: max bytes")
+ return utf8
+ log.debug("max_bytes: %d", max_bytes)
+ i = max_bytes
+ while i > 0 and not utf8_lead_byte(utf8[i]):
+ log.debug("I: %d", i)
+ i -= 1
+ log.debug("RET")
+ return utf8[:i]
+
+ if log is not None:
+ log.debug("SPLIT MSG: %d - %s", max_length, message)
+ s = utf8_byte_truncate(message, max_length)
+ log.debug("S: %s", s)
+ s = s.decode('utf8')
+ log.debug("DECODED")
+ yield s
+ return
+
if len(message) > max_length:
for message in textwrap.wrap(message, max_length):
yield message
I used this implementation to split strings according to byte length: https://stackoverflow.com/a/43848928
HTH.
Thanks for digging into that. I don't like the idea of encoding/decoding the whole messages to... re-encode them before sending to the server. encoding/decoding are very CPU expensive AFAIK.
Any chance we'll get a fix soon? My bot trips on half of the URLs (apparently the internet has decided that having 1k+ characters long sentences in the metadata is a good idea).
I understand the concern about the encoding round trips, however there doesn't seem to be any alternative that I can see.
I'll be ok with that if there is no need to re-decode the data. I'm pretty sure a few if isinstance(data, bytes) will be faster than decoding/encoding the whole messages multiple times. Don't know how hard it is to add those if and how many of them are required.
I'll try to give it a try soon...
For now I've added a prefix argument to split_message (inspired by your snippet). This should reduce your amount of errors. Maybe it's enough to "fix" the problem if you set max_length to an appropriate value to allow some extra unicode characters.. Maybe 500 instead of 512.
Looks like supporting both unicode and bytes in the whole process will not be easy at all..