ckan icon indicating copy to clipboard operation
ckan copied to clipboard

DataStore Actions Return Records

Open JVickery-TBS opened this issue 10 months ago • 21 comments

feat(logic): datastore returned records;

  • Return records from SQL statements.

Proposed fixes:

This will actually record the inserted/updated/deleted rows in the records returned in datastore_create, datastore_upsert, and datastore_delete

Features:

  • [ ] includes tests covering changes
  • [ ] includes updated documentation
  • [ ] includes user-visible changes
  • [ ] includes API changes
  • [ ] includes bugfix for possible backport

Please [X] all the boxes above that apply

JVickery-TBS avatar Feb 21 '25 16:02 JVickery-TBS

This would be really useful for ckanext-dsaudit (it's missing _id in cases where it's not passed and doesn't record the data after trigger modifications)

wardi avatar Feb 21 '25 17:02 wardi

Needs some new or modified tests to cover the new parameter.

wardi avatar Feb 21 '25 17:02 wardi

@wardi okay I made some changes here now for the include_records for all the actions. Will work on some new tests for this

JVickery-TBS avatar Feb 21 '25 18:02 JVickery-TBS

@wardi all that feedback is done now, if you want to take another gander?

JVickery-TBS avatar Feb 21 '25 19:02 JVickery-TBS

With some test coverage this looks great. I'll wait until those are done and discuss this change at the next dev call before merging.

wardi avatar Feb 21 '25 21:02 wardi

@wardi okay tests are in now (plus a small fix to upsert "method" and added include_records to the datastore_records_delete schema).

I THINK the one failed test is unrelated? I cannot seem to recreate the test failure locally.

JVickery-TBS avatar Feb 24 '25 18:02 JVickery-TBS

Is this always going to return the final inserted version of the data -- e.g. upsert with two records containing the same primary key.

EricSoroos avatar Feb 25 '25 13:02 EricSoroos

Nice catch @EricSoroos

{
    "resource_id": "c69611a8-0514-47f4-a6f7-bc42ba1569f4",
    "records": [
        {"prim_id": "1", "example_text_field": "WIZZZZZARD"},
        {"prim_id": "1", "example_text_field": "WIZZZZZARD"}
    ],
    "method": "upsert"
}

Returns:

{
    "help": "http://127.0.0.1:5009/en/api/3/action/help_show?name=datastore_upsert",
    "success": true,
    "result": {
        "method": "upsert",
        "resource_id": "c69611a8-0514-47f4-a6f7-bc42ba1569f4",
        "records": [
            {
                "_id": 69,
                "example_text_field": "WIZZZZZARD",
                "prim_id": "1"
            },
            {
                "_id": 69,
                "example_text_field": "WIZZZZZARD",
                "prim_id": "1"
            }
        ]
    }
}

which is not really what I want here haha. okay will add some check to the _id in upsert_data method. I think we can just rely on the _id field as it is the true primary key

JVickery-TBS avatar Feb 25 '25 14:02 JVickery-TBS

oh right...we have to just take the LAST upsert/update record as that is the one that will be the final DB value. Will add some new tests to cover this as well

JVickery-TBS avatar Feb 25 '25 15:02 JVickery-TBS

One option for the upsert (which would greatly improve the performance) would be to do:

insert into table (fields) values (values)
on conflict (unique_fields_list) do 
update field=excluded.field, field2=excluded.field2, ...

Then if that was an insert returning, you'd get back all the info in one shot. You would need to know the unique columns on the table though, since you need that for the on conflict clause.

EricSoroos avatar Feb 25 '25 15:02 EricSoroos

@EricSoroos okay that should be fixed now, I did it in a very very bad way. But then imagined what Ian's comment on my 20 nested loops would be, so now it is done in a good-ish pythonian way.

ohhhh so with the INSERT INTO ... ON CONFLICT ... would replace the 2 separate sql statements that are there now? so it would try insertion first, then update on conflict? would that work without primary index keys, or would we just use _id if no custom primary keys?

JVickery-TBS avatar Feb 25 '25 15:02 JVickery-TBS

INSERT ... ON CONFLICT is postgresql's version of upsert, and it's been in since 9.4 or so. So yes, one command. but you do need to specify which columns you're looking at for the conflict, and what is the resolution. (You can actually do things like on conflict(field) do nothing, or other options, but that's not really useful for us)

So you'd have to get the fields with a unique key, which can be done with a query against the pg metadata tables (and is currently used in the datastore pg backend for getting unique indexes, though you can't quite use that code)

            # check to see if there are any unique indexes
            cur.execute(
                sql.SQL("""select att.attname
                              from pg_attribute att,
                                pg_class t,
                                pg_index idx
                              where
                                t.oid = idx.indrelid
                                AND t.relkind = 'r'
                                AND idx.indisunique = 't'
                                AND t.relname = %s
                                and att.attnum in (select unnest(idx.indkey))
                                and att.attrelid=idx.indrelid"""), (resource_id,))
            unique_columns_list = cur.fetchall()

EricSoroos avatar Feb 25 '25 16:02 EricSoroos

@EricSoroos yeah I think the upsert_data method already gathers all of the unique_keys / primary_keys and uses them in the sql statements, so should hopefully be an easy change here. But will test it out an see what happens

JVickery-TBS avatar Feb 25 '25 16:02 JVickery-TBS

@EricSoroos okay the SQL statement for the UPSERT method in upsert_data method has been changed now to a simplified INSERT INTO ... ON CONFLICT ... DO UPDATE statement. will see how all the tests run here.

JVickery-TBS avatar Feb 25 '25 16:02 JVickery-TBS

Hi @JVickery-TBS I let this slip, do you mind resolving the merge conflicts on this?

@EricSoroos how do you feel about these changes in their current state?

wardi avatar May 28 '25 20:05 wardi

@wardi conflicts resolved. Still an issue with the deleted records return for large datasets do you think? (https://github.com/open-data/ckan/pull/200)

JVickery-TBS avatar Jun 03 '25 14:06 JVickery-TBS

Yes we'll want a hard limit on the records returned from delete because it will consume an arbitrary amount of memory, or time out and fail the operation.

We should also be consistent about returning the records as records from all actions, or change the delete parameter to include_deleted_records if you prefer returning them as deleted_records.

wardi avatar Jun 03 '25 14:06 wardi

@wardi okay I changed include_records -> include_deleted_records for datastore_delete and datastore_records_delete actions and schemas.

I also updated the code in the postgres.py for delete_data for the LIMIT in the returning statement. I just used the definition of ckan.datastore.search.rows_max but lemme know if it should just be a hard coded limit (or a new, different configurable option)

JVickery-TBS avatar Jun 03 '25 16:06 JVickery-TBS

@wardi and that is why you need to keep telling me to write tests for everything. When I changed the key in the records_delete schema, I did not add a pop to the cloned dict in the psql validation method. So that is added now

JVickery-TBS avatar Aug 20 '25 20:08 JVickery-TBS

Unfortunately, this errors with dsaudit -- setting the include_records to default to false means that by default, anything passed through datastore_upsert is just going to be a bare response for success, and not include the records in the dictionary.

dsaudit/actions.py unconditionally checks the records parameter here: https://github.com/ckan/ckanext-dsaudit/blob/master/ckanext/dsaudit/actions.py#L116

activity_data = {
        'fields': [
            f for f in srval['fields']
            if any(f['id'] in r for r in rval['records'])
        ],
        'records': rval['records'],
        'method': rval.get('method', 'upsert'),
        'resource_id': res.id,
    }

There are a couple of options:

  1. dsaudit can force return values (if this is merged??) and then drop them from the response if they would otherwise be false or undefined.
  2. we can just skip the dsaudit if the response isn't returned.
  3. or this patch can return [] for the records to preserve the return type.

EricSoroos avatar Nov 12 '25 18:11 EricSoroos

@EricSoroos I've added a fix to dsaudit that should address the incompatibility. If that looks good to you please go ahead and merge these changes.

wardi avatar Nov 28 '25 21:11 wardi