Is ON CONFLICT DO UPDATE; possible?
Was reading up on how ON CONFLICT DO NOTHING; works in the PostgreSQL docs (which is a great addition), and saw that ON CONFLICT DO UPDATE; is also an option, which is effectively an "upsert."
I wonder if that's also useful in a bulk loading context?
Has Django tackled this yet?
Not sure if at the database level, but it's essentially what update_or_create() does.
I don't think ON CONFLICT DO UPDATE; by itself is a thing. ON CONFLICT DO UPDATE SET YOUR_UPDATE_LOGIC; is, though. (As far as I know, at least)
Which I find a bit conflicting since the amount of possible logic combinations becomes very large; now the user can choose any combination of expressions, default values and sub-queries per every column to be updated, plus a where condition to all that logic.
I definitely think it's a great addition. I just wonder if we'd be better off simply leaving an empty slot to which the user can add its own update logic string, which is an idea I also dislike.
I would love to see this feature. We import about 11 million records a day and COPY is by far the fastest.
Dango Postgres Extras currently handles upserts.
https://django-postgres-extra.readthedocs.io/manager/#upserting
I like the way Django Postgres Extras handles it, but I believe copying that functionality would require some mayor rewriting of how Django Postgres Copy works.
A very rudimentary solution I propose is passing a list with the following order:
- Operation (
DO NOTHINGorDO UPDATE) - Constraint (the name of the column with a constraint)
- Columns to update (in the case of
DO UPDATE)
I replaced ignore_conflicts with on_conflict and defaults to an empty list:
def __init__(
...
on_conflict=[],
...
):
def insert_suffix(self):
# If on_conflict is not an empty list
if self.on_conflict:
# First item on list - Operation
if self.on_conflict[0] == 'DO NOTHING':
return """
ON CONFLICT DO NOTHING;
"""
elif self.on_conflict[0] == 'DO UPDATE':
# Second item on list - Constraint
constraint = self.on_conflict[1]
# Delete first two items on list. Only columns to be updated remain
del self.on_conflict[0:2]
update_columns =
', '.join(["{0} = EXCLUDED.{0}".format(col) for col in self.on_conflict])
return """
ON CONFLICT ({0}) DO UPDATE SET {1};
""".format(constraint, update_columns)
else:
return ";"
I tested it out and it seems to work well. I understand it's not the most pythonic solution though
an example would be:
Customers.objects.from_csv('/path/to/file.csv', on_conflict=['DO UPDATE', 'name', 'email'])
with multiple fields to update:
Customers.objects.from_csv('/path/to/file.csv', on_conflict=['DO UPDATE', 'name', 'email', 'active', 'age'])
or in plain english:
if field on column 'name' already exists, update 'email', 'active' and 'age' fields
Thanks a ton of the snippet. We forked the project and got everything going.
We also added temp_table_name_suffix because this will be used in Airflow with Celery, and also added static_mapping_on_conflict to meet our needs.
Wow. I'm impressed. You guys are flying by my understanding of ON CONFLICT. I'm open to finding a way to include it in this library, but I'm reluctant to allow raw SQL to be passed into Python keyword arguments.
If it is possible, I'd prefer to adopt a Pythonic system using Django idioms or proven techniques from other Python libraries like psycopg2. If something like that exists and you guys think it will work, let me know.
If not, perhaps in the meantime we could extend the private functions inside this library so that ones that tack together the raw SQL can be extended by creative users such as yourself.
Here's a look at how django-postgres-extra handles their INSERT and UPDATE statements. They subclass from Django's SQLCompiler and its derivatives. I have given the SQLCompiler a few reads and I can't quite comprehend how it works. I understand its purpose, I just seem to get lost in the code.
It seems that extending Django's functionality is the best practice, however django-postgres-copy went a different route from the beginning and it would require a major rewrite of the library to subclass from the SQLCompiler.
I'm definitely not an expert in the matter. The above is simply my impression after trying to figure this out.
Any chance this is on the roadmap? Looks like it would be a great addition to the lib.
I'm putting together a PR that addresses this as well as #122; I believe my approach is sufficient to avoid concerns with raw SQL being passed as args/kwargs. I've got the core of the code working well although I need to add tests (admittedly, not my strongest suit).
@palewire Since 2.7.2 requires Django 4, does that mean you're okay with Python 3.8 coding, e.g. walrus operator, or would you prefer those left out?
Hoping to submit the PR by Sunday or Monday.
This feature would be great for my use case too : keeping imported data up-to-date with the source (daily synchronization).
I would not use a on_conflict argument but just a update_key argument taking the column (or tuple/list of columns) that are part of the unique constraint to use as the update arbiter. We could also just accept True as an argument, in this case the unique constraint would be inferred from Django model (an exception would be raised if there are several unique constraints).
An optional update_columns parameter could take the list of columns to be updated, the default being all columns except static ones.
Or maybe a simple update argument that could accept True (because update_key=True does not make much sense to me).
update_key and update_columns (both taking columns) would then be optional arguments when update=True is specified.
What do you think?
@ipamo What about a simple on_conflict param that defaults to None, but accepts either 'update' or 'ignore' (None would proceed without any ON CONFLICT logic, and 'ignore' would replace the current ignore_conflicts param)?
I designed my original PR to mirror the actual logic of the ON CONFLICT as close as possible (hence passing a dictionary as on_conflict with keys for target and action), but I can see the benefit of simplifying things a bit and detecting any constraints instead of forcing the developer to name them specifically.
@palewire Any thoughts?
Using on_conflict as you suggest has the benefit of having a single argument for both cases (ignore and update), which is great, but :
- This will make the existing
ignore_conflictsobsolete. - It is necessary to specify labels as argument values ('update' or 'ignore'). Just using booleans seems more easier/pythonic.
- I'm not sure all users would make a relation between the idea of a conflict and the update. I see the idea of the conflict as Postgres INSERT query "internals" (probably also because of my MSSQL backgrounds where upserts are done with a MERGE command without any idea of a conflict). To me, just using
update=Trueis much better semantically and also easier to remember.
The last point is obviously a quite personal reason but I prefer to state it as others might have the same impression.
Also, I think we need an easy way to specify the key (if there are several unique constraints) and the updated columns. Hence update_key and update_columns (which we could specify as implying update=True).
I also like the idea of static_mapping_on_conflict which I saw above. I would name it static_mapping_on_update (because of the above reasons + it is not used at all in the "ON CONFLICT DO NOTHING" case).
Regarding static_mapping we could also find a way to avoid specifying a second parameter ? That could be good because for many static columns we may want to do the same thing, and we would not easily know/forget whether we have to put the same values in both parameters.
Regarding
static_mappingwe could also find a way to avoid specifying a second parameter ? That could be good because for many static columns we may want to do the same thing, and we would not easily know/forget whether we have to put the same values in both parameters.
Idea: using suffix in dict key":~ for updates only.
Use case 1:
{
"created": datetime.now(),
"updated~": datetime.now(),
}
createdwould be set only at insertionupdatedwould be None at insertion
Use case 2:
{
"mycolumn": "X",
}
- Compatible with existing specification: column is inserted with the given value.
- Static columns would not be included in the updated columns by default (so no change on updates).
- If
update_columnsincludemycolumn, then it is updated with the given static value.
Use case 3:
{
"mycolumn": "X",
"mycolumn~": "Y",
}
- Column set to X at insertion and Y at update.
Idea: using suffix in dict key:~ for updates only.
Well I realize it's probably too complex. static_mapping_on_update is probably better?
We would have to document clearly that:
- Columns in
static_mappingare not updated if they are not instatic_mapping_on_updateor inupdate_columns. - Columns in
static_mapping_on_updateare inserted with null/default value if they are not instatic_mapping.
Let me know what you think.
I think your concepts dealing with static_mapping might be a bit too complicated (EDIT: but perhaps I just need to review them further, will look into it more when I get a chance).
I feel like using an on_conflict argument might be best this is, after all, a postgres-specific package so it seems appropriate to utilize postgres-specific language even though other databases might use e.g.,"merge" or "update/upsert".
Furthermore, while I understand your hesitation of using labels ('update' or 'ignore'), if we used e.g. booleans, then it would be possible to have update=True, ignore=True, which would be unclear, as opposed to just raising a ValueError if on_conflict is not one of None, ignore, or update.
I understand your points too, even if I stay with my personal preference ;)
How would you deal with specifying the update key / columns ?
Well, my PR #156 uses a dictionary such as
on_conflict = {
'action': 'update',
'target': <column name or constraint name>,
'columns': ['update_col_1', 'update_col_2'],
}
This follows the convention for the ON CONFLICT cause outlined here: https://www.postgresql.org/docs/current/sql-insert.html