django-postgres-copy Is ON CONFLICT DO UPDATE; possible?

Was reading up on how ON CONFLICT DO NOTHING; works in the PostgreSQL docs (which is a great addition), and saw that ON CONFLICT DO UPDATE; is also an option, which is effectively an "upsert."

I wonder if that's also useful in a bulk loading context?

Aug 22 '18 21:08 rdmurphy

Has Django tackled this yet?

Aug 22 '18 21:08 palewire

Not sure if at the database level, but it's essentially what update_or_create() does.

Aug 22 '18 21:08 rdmurphy

I don't think ON CONFLICT DO UPDATE; by itself is a thing. ON CONFLICT DO UPDATE SET YOUR_UPDATE_LOGIC; is, though. (As far as I know, at least)

Which I find a bit conflicting since the amount of possible logic combinations becomes very large; now the user can choose any combination of expressions, default values and sub-queries per every column to be updated, plus a where condition to all that logic.

I definitely think it's a great addition. I just wonder if we'd be better off simply leaving an empty slot to which the user can add its own update logic string, which is an idea I also dislike.

Aug 23 '18 04:08 igncampa

I would love to see this feature. We import about 11 million records a day and COPY is by far the fastest.

Dango Postgres Extras currently handles upserts.

https://django-postgres-extra.readthedocs.io/manager/#upserting

Aug 24 '18 14:08 timmyGr

I like the way Django Postgres Extras handles it, but I believe copying that functionality would require some mayor rewriting of how Django Postgres Copy works.

A very rudimentary solution I propose is passing a list with the following order:

Operation (DO NOTHING or DO UPDATE)
Constraint (the name of the column with a constraint)
Columns to update (in the case of DO UPDATE)

I replaced ignore_conflicts with on_conflict and defaults to an empty list:

def __init__(
    ...
    on_conflict=[],
    ...
):

def insert_suffix(self):
    # If on_conflict is not an empty list
    if self.on_conflict:
        # First item on list - Operation
        if self.on_conflict[0] == 'DO NOTHING':
            return """
                ON CONFLICT DO NOTHING;
            """
        elif self.on_conflict[0] == 'DO UPDATE':
            # Second item on list - Constraint
            constraint = self.on_conflict[1]
            # Delete first two items on list. Only columns to be updated remain
            del self.on_conflict[0:2]
            update_columns =
                ', '.join(["{0} = EXCLUDED.{0}".format(col) for col in self.on_conflict])
            return """
                ON CONFLICT ({0}) DO UPDATE SET {1};
                """.format(constraint, update_columns)
    else:
        return ";"

I tested it out and it seems to work well. I understand it's not the most pythonic solution though

an example would be: Customers.objects.from_csv('/path/to/file.csv', on_conflict=['DO UPDATE', 'name', 'email']) with multiple fields to update: Customers.objects.from_csv('/path/to/file.csv', on_conflict=['DO UPDATE', 'name', 'email', 'active', 'age']) or in plain english: if field on column 'name' already exists, update 'email', 'active' and 'age' fields

Aug 24 '18 18:08 igncampa

Thanks a ton of the snippet. We forked the project and got everything going.

We also added temp_table_name_suffix because this will be used in Airflow with Celery, and also added static_mapping_on_conflict to meet our needs.

Aug 29 '18 19:08 timmyGr

Wow. I'm impressed. You guys are flying by my understanding of ON CONFLICT. I'm open to finding a way to include it in this library, but I'm reluctant to allow raw SQL to be passed into Python keyword arguments.

If it is possible, I'd prefer to adopt a Pythonic system using Django idioms or proven techniques from other Python libraries like psycopg2. If something like that exists and you guys think it will work, let me know.

If not, perhaps in the meantime we could extend the private functions inside this library so that ones that tack together the raw SQL can be extended by creative users such as yourself.

Sep 01 '18 18:09 palewire

Here's a look at how django-postgres-extra handles their INSERT and UPDATE statements. They subclass from Django's SQLCompiler and its derivatives. I have given the SQLCompiler a few reads and I can't quite comprehend how it works. I understand its purpose, I just seem to get lost in the code.

It seems that extending Django's functionality is the best practice, however django-postgres-copy went a different route from the beginning and it would require a major rewrite of the library to subclass from the SQLCompiler.

I'm definitely not an expert in the matter. The above is simply my impression after trying to figure this out.

Sep 25 '18 16:09 igncampa

Any chance this is on the roadmap? Looks like it would be a great addition to the lib.

Jul 23 '22 00:07 geoffrey-eisenbarth

I'm putting together a PR that addresses this as well as #122; I believe my approach is sufficient to avoid concerns with raw SQL being passed as args/kwargs. I've got the core of the code working well although I need to add tests (admittedly, not my strongest suit).

@palewire Since 2.7.2 requires Django 4, does that mean you're okay with Python 3.8 coding, e.g. walrus operator, or would you prefer those left out?

Hoping to submit the PR by Sunday or Monday.

Mar 04 '23 03:03 geoffrey-eisenbarth

This feature would be great for my use case too : keeping imported data up-to-date with the source (daily synchronization).

I would not use a on_conflict argument but just a update_key argument taking the column (or tuple/list of columns) that are part of the unique constraint to use as the update arbiter. We could also just accept True as an argument, in this case the unique constraint would be inferred from Django model (an exception would be raised if there are several unique constraints).

An optional update_columns parameter could take the list of columns to be updated, the default being all columns except static ones.

Mar 07 '23 22:03 ipamo

Or maybe a simple update argument that could accept True (because update_key=True does not make much sense to me).

update_key and update_columns (both taking columns) would then be optional arguments when update=True is specified.

What do you think?

Mar 07 '23 22:03 ipamo

@ipamo What about a simple on_conflict param that defaults to None, but accepts either 'update' or 'ignore' (None would proceed without any ON CONFLICT logic, and 'ignore' would replace the current ignore_conflicts param)?

I designed my original PR to mirror the actual logic of the ON CONFLICT as close as possible (hence passing a dictionary as on_conflict with keys for target and action), but I can see the benefit of simplifying things a bit and detecting any constraints instead of forcing the developer to name them specifically.

@palewire Any thoughts?

Mar 07 '23 22:03 geoffrey-eisenbarth

Using on_conflict as you suggest has the benefit of having a single argument for both cases (ignore and update), which is great, but :

This will make the existing ignore_conflicts obsolete.
It is necessary to specify labels as argument values ('update' or 'ignore'). Just using booleans seems more easier/pythonic.
I'm not sure all users would make a relation between the idea of a conflict and the update. I see the idea of the conflict as Postgres INSERT query "internals" (probably also because of my MSSQL backgrounds where upserts are done with a MERGE command without any idea of a conflict). To me, just using update=True is much better semantically and also easier to remember.

The last point is obviously a quite personal reason but I prefer to state it as others might have the same impression.

Also, I think we need an easy way to specify the key (if there are several unique constraints) and the updated columns. Hence update_key and update_columns (which we could specify as implying update=True).

I also like the idea of static_mapping_on_conflict which I saw above. I would name it static_mapping_on_update (because of the above reasons + it is not used at all in the "ON CONFLICT DO NOTHING" case).

Mar 07 '23 23:03 ipamo

Regarding static_mapping we could also find a way to avoid specifying a second parameter ? That could be good because for many static columns we may want to do the same thing, and we would not easily know/forget whether we have to put the same values in both parameters.

Mar 07 '23 23:03 ipamo

Regarding static_mapping we could also find a way to avoid specifying a second parameter ? That could be good because for many static columns we may want to do the same thing, and we would not easily know/forget whether we have to put the same values in both parameters.

Idea: using suffix in dict key":~ for updates only.

Use case 1:

{
     "created": datetime.now(),
     "updated~": datetime.now(),
}

created would be set only at insertion
updated would be None at insertion

Use case 2:

{
    "mycolumn": "X",
}

Compatible with existing specification: column is inserted with the given value.
Static columns would not be included in the updated columns by default (so no change on updates).
If update_columns include mycolumn, then it is updated with the given static value.

Use case 3:

{
    "mycolumn": "X",
    "mycolumn~": "Y",
}

Column set to X at insertion and Y at update.

Mar 07 '23 23:03 ipamo

Idea: using suffix in dict key:~ for updates only.

Well I realize it's probably too complex. static_mapping_on_update is probably better?

We would have to document clearly that:

Columns in static_mapping are not updated if they are not in static_mapping_on_update or in update_columns.
Columns in static_mapping_on_update are inserted with null/default value if they are not in static_mapping.

Let me know what you think.

Mar 07 '23 23:03 ipamo

I think your concepts dealing with static_mapping might be a bit too complicated (EDIT: but perhaps I just need to review them further, will look into it more when I get a chance).

I feel like using an on_conflict argument might be best this is, after all, a postgres-specific package so it seems appropriate to utilize postgres-specific language even though other databases might use e.g.,"merge" or "update/upsert".

Furthermore, while I understand your hesitation of using labels ('update' or 'ignore'), if we used e.g. booleans, then it would be possible to have update=True, ignore=True, which would be unclear, as opposed to just raising a ValueError if on_conflict is not one of None, ignore, or update.

Mar 08 '23 00:03 geoffrey-eisenbarth

I understand your points too, even if I stay with my personal preference ;)

How would you deal with specifying the update key / columns ?

Mar 08 '23 01:03 ipamo

Well, my PR #156 uses a dictionary such as

on_conflict = {
  'action': 'update',
  'target': <column name or constraint name>,
  'columns': ['update_col_1', 'update_col_2'],
}

This follows the convention for the ON CONFLICT cause outlined here: https://www.postgresql.org/docs/current/sql-insert.html

Mar 08 '23 01:03 geoffrey-eisenbarth