sopel
sopel copied to clipboard
Race Condition With Database Writes (`set_nick_value()`)
Description
If two commands try to write a nick value (set_nick_value()
) for the a nickname at the same time, before that nickname exists in the database, you will end up with multiple nick_ids
for one nicknames
.
Reproduction steps
- Install Sopel (doesn't matter if stable or from git).
- Create a basic enough
default.cfg
so that Sopel will connect to your desired server and channel. - Add
test.py
(see below) to your/plugins
subfolder - Start Sopel
- Type any message into chat and you will either get an error in chat or in your console (or both!).
- Investigate your database and you will see in the
nicknames
table that one nick has twonick_id
s associated with it.
test.py
:
from sopel import plugin
@plugin.rule(r".*")
def test_break(bot, trigger):
bot.db.set_nick_value(trigger.nick, 'test_value', 0)
Expected behavior
There should only be one nick_id created for one nickname.
Relevant logs
[2023-06-19 16:26:44,218] sopel.bot ERROR - Unexpected MultipleResultsFound (Multiple rows were found when one or none was required) from test at 2023-06-19 21:26:44.218692+00:00. Message was: test
Traceback (most recent call last):
File "/home/test/bots/testbot/lib/python3.10/site-packages/sopel/bot.py", line 703, in call_rule
rule.execute(sopel, trigger)
File "/home/test/bots/testbot/lib/python3.10/site-packages/sopel/plugins/rules.py", line 1267, in execute
exit_code = self._handler(bot, trigger)
File "/home/test/bots/testbot/sopel/plugins/test.py", line 6, in test_break
bot.db.set_nick_value(trigger.nick, 'test_value', 0)
File "/home/test/bots/testbot/lib/python3.10/site-packages/sopel/db.py", line 466, in set_nick_value
nick_id = self.get_nick_id(nick, create=True)
File "/home/test/bots/testbot/lib/python3.10/site-packages/sopel/db.py", line 378, in get_nick_id
).scalar_one_or_none()
File "/home/test/bots/testbot/lib/python3.10/site-packages/sqlalchemy/engine/result.py", line 1225, in scalar_one_or_none
return self._only_one_row(
File "/home/test/bots/testbot/lib/python3.10/site-packages/sqlalchemy/engine/result.py", line 614, in _only_one_row
raise exc.MultipleResultsFound(
sqlalchemy.exc.MultipleResultsFound: Multiple rows were found when one or none was required
[2023-06-19 16:26:44,219] sopel.modules.seen ERROR - Unable to save seen, database error: Multiple rows were found when one or none was required
Notes
Make sure you're testing with with a fresh database or using nicks your bot has never seen before.
Sopel version
v8.0.0.dev0
Installation method
pip install
Python version
3.10.12
Operating system
No response
IRCd
No response
Relevant plugins
seen.py
I'm not able to repro this with the latest master and python 3.10.6
Ah. After completely blowing away the database, I can repro this.
~~@RustyBower: you're testing either with a fresh database or a nick your bot's never seen before?~~
He is now. :P
Might be reasonable for "this is a new nick I haven't seen before" to be handled somewhere in core before plugins ever have a chance to trigger, maybe a @plugin.priority('high')
+ @plugin.unblockable
handler in coretasks
. There's already a JOIN
handler that would be a natural place to do this, and it's already intended to run before any plugins get their mitts on the trigger
.
(Yes, it'd still be possible for plugins to intentionally attempt to have their code run before coretasks
—we even have a documentation tip about handling events before Sopel does. But making it intentional, not accidental, would be a big step all the same.)
I can reproduce this on both 3.10 and 3.7. This is a TOCTOU issue in set_nick_value()
.
A user plugin should be able to avoid this issue in 7.1.9 or against development (7bcb942) by adding @plugin.thread(False)
to their handlers, which forces the transaction onto the same thread (main) as the seen
plugin. @wkeiuluf this may be useful to you as an immediate workaround, but the bug is I think a legitimate issue in Sopel.
maybe a
@plugin.priority('high')
+@plugin.unblockable
handler in coretasks
I'm a -1 on this since it means doing work in the core that may be serving a plugin that isn't even enabled. I imagine most Sopel instances do enable seen
, but it feels too much like coupling between a plugin and the core. FWIW, I tried altering the seen
plugin to run with "high"
priority and it did not affect the behavior of the reproduction, so I'm not even sure if this approach will work, philosophical matters aside.
If we declare the slug
column to be unique=True
, we can at least force an IntegrityError
in this collision case instead writing bad data to the DB, but I'm not sure what the rest of a patch that does this would look like, it wouldn't be very polite of us to issue an error to the user in this case, but I'm not sure that just retrying the transaction again is a good idea either. Still messing around with it…
I'm a -1 on this since it means doing work in the core that may be serving a plugin that isn't even enabled. I imagine most Sopel instances do enable
seen
, but it feels too much like coupling between a plugin and the core.
Think I was too brief before. I don't mean "a nick I haven't seen before" in the context of seen
the plugin, but rather a nick that doesn't exist in the database yet. Sopel already adds to bot.users
(and the lists for each channel) when it sees a JOIN
, and it wouldn't be unreasonable to ping the database with (spitballing here) bot.db.get_nick_id(new_nick, create=true)
so the database also is "aware of" the new nick, just as the bot object in memory is.
Maybe it would eliminate this particular race/TOCTOU, or maybe not—but it also seems unnecessarily laissez-faire to wait until someone calls bot.db.set_nick_value()
to generate an ID. On the other hand, we don't want the DB to fill with useless rows for every temporary nickname that joins after a netsplit… Handling the issue directly does seem like the smart move, long term.
(All that said, the database schema really needs a revision to combine the nick_ids
and nicknames
tables. The way it's implemented now is silly, and we know it—but I think it's a holdover from some quirk of SQLite autoincrement or something… not sure exactly, it predates my involvement. Making a schema change without first building out a migration path just seems "too YOLO", though.)
If we declare the
slug
column to beunique=True
, we can at least force anIntegrityError
in this collision case instead writing bad data to the DB, but I'm not sure what the rest of a patch that does this would look like
I'm a little more sure now what this solution would look like. Making the slug
unique causes the IntegrityError
to occur earlier than I had realized, when calling get_nick_id()
which mutates the DB (yuck). The patch below tries to detect this condition and backs off the attempted creation of a new entry in the Nicknames
table, and I can't reproduce the bug with this patch applied.
Click for patch diff
diff --git a/sopel/db.py b/sopel/db.py
index 2763495f..191d6d1c 100644
--- a/sopel/db.py
+++ b/sopel/db.py
@@ -24,7 +24,7 @@ import typing
from sqlalchemy import Column, create_engine, ForeignKey, Integer, String
from sqlalchemy.engine.url import make_url, URL
-from sqlalchemy.exc import OperationalError
+from sqlalchemy.exc import OperationalError, IntegrityError
from sqlalchemy.orm import declarative_base, scoped_session, sessionmaker
from sqlalchemy.sql import delete, func, select, update
@@ -68,7 +68,7 @@ class Nicknames(BASE):
__tablename__ = 'nicknames'
__table_args__ = MYSQL_TABLE_ARGS
nick_id = Column(Integer, ForeignKey('nick_ids.nick_id'), primary_key=True)
- slug = Column(String(255), primary_key=True)
+ slug = Column(String(255), primary_key=True, unique=True)
canonical = Column(String(255))
@@ -397,14 +397,26 @@ class SopelDB:
session.add(nick_id)
session.commit()
- # Create a new Nickname
- nickname = Nicknames(
- nick_id=nick_id.nick_id,
- slug=slug,
- canonical=nick,
- )
- session.add(nickname)
- session.commit()
+ try:
+ # Create a new Nickname
+ nickname = Nicknames(
+ nick_id=nick_id.nick_id,
+ slug=slug,
+ canonical=nick,
+ )
+ session.add(nickname)
+ session.commit()
+ except IntegrityError as exc:
+ if exc.args and "UNIQUE constraint failed: nicknames.slug" in exc.args[0]:
+ LOGGER.warning("Data race detected when calling get_nick_id()")
+ # another thread beat us to creating this entry, rollback and query again
+ session.rollback()
+ nickname = session.execute(
+ select(Nicknames).where(Nicknames.slug == slug)
+ ).scalar_one_or_none()
+ else:
+ # something else went wrong, re-raise
+ raise
However, this patch is incomplete, because a new nick_id
is still generated by the earlier transaction in get_nick_id()
. Below are the contents of my test bot's DB after a test with this patch applied, note that there are only two nicknames, but four nick_ids:
sqlite> SELECT * FROM nicknames;
2|snoopj|SnoopJ
3|terribot|terribot
sqlite> SELECT * FROM nick_ids;
1
2
3
4
It sounds like either the core task or DB revision that @dgw proposes would avoid this problem, with the core task neatly avoiding this issue altogether. I would be in favor of promptly populating the table before any plugins (including official ones) have the option to cause havoc, but I'm not familiar enough with these Sopel guts to know what that looks like.
I don't want to further slow 8.0.0
down, but could we add this to the 8.1.0
milestone?
Since get_nick_id()
is the problematic code path, I'm sure we could cut the frequency of this issue down to (almost?) nothing using a Lock
object and a well-placed with nick_id_lock:
context manager.
And in the future, once we sort out the path toward proper database migrations (#1784), some simplified version of the ridiculous three-table schema we have for nicknames can be applied to channels, so they too can benefit from case-insensitivity. But I'm getting way ahead of this issue, huh?