pgcontents
pgcontents copied to clipboard
Improving upserts
In my setup I ran into problems with updates of notebooks. Because the insert failed, the transaction ended up in a weird state. As a consequence the update was not executed. This PR changes the code to first check whether a file exists or not. If it does not exist, an insert is performed. If the file does exist an update query is executed.
Also added a paragraph to the README, to explain why some tables can become "invisible" when inspecting the database, due to the table being in a different namespace.
I was under the assumption that the select and insert/update would happen in a single transaction, because of the db.begin_nested() in the with statement.
This is true. These will both happen in a single transaction, however,
Hence, if two transaction are trying to save (a new notebook) simultaneously, one would win, the other would need to retry a save and get updated in the second attempt.
Having these happen in separate transactions doesn't prevent the race I'm concerned about. Suppose that there are two writers, A and B, that are each trying to write a new notebook to the same location. The following sequence of events is possible:
- Writer A calls
_file_exists, which issues a query to the database checking if the desired file already exists. - Concurrently, writer B also calls
_file_exists. - Writer A's call finishes, reporting that the file doesn't exist.
- Writer B's call finishes, reporting that the file doesn't exist.
- Both writers try to write the file. One of them will succeed, and the other will get a unique key violation, which will now be unhandled (this is what we were trying to handle previously by catching
IntegrityError.
This is a pretty common race for insert-or-update operations, and the usual way to fix it (without having dedicated "upsert" operations) to try the operation and handle the error if it occurs, which is what this code was trying to do previously. I'm not sure what's going wrong though. The error suggests that we might need to exit the begin_nested block (which corresponds to a SAVEPOINT/ROLLBACK pair), but I think it'd probably be better to use the proper upsert operation now that it exists in pretty much all reasonable versions of postgres.
On the upsert: very interesting. My initial attempt at using the on_conflict_do_update fails with Insert does not have an attribute on_conflict_do_update. This is the code fragment:
I think the issue here is that on_conflict_do_update is a postgres-specific operation. Sqlalchemy puts database-specific functionality in special "dialect" modules so that you know you're doing something that might not be portable across databases. Instead of doing files.insert, I think we need to do:
# This creates an `Insert` object that supports postgres-specific functionality.
from sqlalchemy.dialects.postgresql import insert
db.execute(insert(files).values(...).on_conflict_do_update(...)
Thanks for the suggestion. I got the upsert working with SQLAlchemy. What do you think?
@jorisgillis these changes look good to me! Unfortunately, it appears that I no longer have permission to merge PRs to this repo. Quantopian, the company for whom I originally wrote this plugin, no longer exists, so unfortunately many of their open source repos are in a bit of a state of flux. I'll see if I can contact someone about getting permissions back on this repo