DataStore Actions Return Records
feat(logic): datastore returned records;
- Return records from SQL statements.
Proposed fixes:
This will actually record the inserted/updated/deleted rows in the records returned in datastore_create, datastore_upsert, and datastore_delete
Features:
- [ ] includes tests covering changes
- [ ] includes updated documentation
- [ ] includes user-visible changes
- [ ] includes API changes
- [ ] includes bugfix for possible backport
Please [X] all the boxes above that apply
This would be really useful for ckanext-dsaudit (it's missing _id in cases where it's not passed and doesn't record the data after trigger modifications)
Needs some new or modified tests to cover the new parameter.
@wardi okay I made some changes here now for the include_records for all the actions. Will work on some new tests for this
@wardi all that feedback is done now, if you want to take another gander?
With some test coverage this looks great. I'll wait until those are done and discuss this change at the next dev call before merging.
@wardi okay tests are in now (plus a small fix to upsert "method" and added include_records to the datastore_records_delete schema).
I THINK the one failed test is unrelated? I cannot seem to recreate the test failure locally.
Is this always going to return the final inserted version of the data -- e.g. upsert with two records containing the same primary key.
Nice catch @EricSoroos
{
"resource_id": "c69611a8-0514-47f4-a6f7-bc42ba1569f4",
"records": [
{"prim_id": "1", "example_text_field": "WIZZZZZARD"},
{"prim_id": "1", "example_text_field": "WIZZZZZARD"}
],
"method": "upsert"
}
Returns:
{
"help": "http://127.0.0.1:5009/en/api/3/action/help_show?name=datastore_upsert",
"success": true,
"result": {
"method": "upsert",
"resource_id": "c69611a8-0514-47f4-a6f7-bc42ba1569f4",
"records": [
{
"_id": 69,
"example_text_field": "WIZZZZZARD",
"prim_id": "1"
},
{
"_id": 69,
"example_text_field": "WIZZZZZARD",
"prim_id": "1"
}
]
}
}
which is not really what I want here haha. okay will add some check to the _id in upsert_data method. I think we can just rely on the _id field as it is the true primary key
oh right...we have to just take the LAST upsert/update record as that is the one that will be the final DB value. Will add some new tests to cover this as well
One option for the upsert (which would greatly improve the performance) would be to do:
insert into table (fields) values (values)
on conflict (unique_fields_list) do
update field=excluded.field, field2=excluded.field2, ...
Then if that was an insert returning, you'd get back all the info in one shot. You would need to know the unique columns on the table though, since you need that for the on conflict clause.
@EricSoroos okay that should be fixed now, I did it in a very very bad way. But then imagined what Ian's comment on my 20 nested loops would be, so now it is done in a good-ish pythonian way.
ohhhh so with the INSERT INTO ... ON CONFLICT ... would replace the 2 separate sql statements that are there now? so it would try insertion first, then update on conflict? would that work without primary index keys, or would we just use _id if no custom primary keys?
INSERT ... ON CONFLICT is postgresql's version of upsert, and it's been in since 9.4 or so. So yes, one command. but you do need to specify which columns you're looking at for the conflict, and what is the resolution. (You can actually do things like on conflict(field) do nothing, or other options, but that's not really useful for us)
So you'd have to get the fields with a unique key, which can be done with a query against the pg metadata tables (and is currently used in the datastore pg backend for getting unique indexes, though you can't quite use that code)
# check to see if there are any unique indexes
cur.execute(
sql.SQL("""select att.attname
from pg_attribute att,
pg_class t,
pg_index idx
where
t.oid = idx.indrelid
AND t.relkind = 'r'
AND idx.indisunique = 't'
AND t.relname = %s
and att.attnum in (select unnest(idx.indkey))
and att.attrelid=idx.indrelid"""), (resource_id,))
unique_columns_list = cur.fetchall()
@EricSoroos yeah I think the upsert_data method already gathers all of the unique_keys / primary_keys and uses them in the sql statements, so should hopefully be an easy change here. But will test it out an see what happens
@EricSoroos okay the SQL statement for the UPSERT method in upsert_data method has been changed now to a simplified INSERT INTO ... ON CONFLICT ... DO UPDATE statement. will see how all the tests run here.
Hi @JVickery-TBS I let this slip, do you mind resolving the merge conflicts on this?
@EricSoroos how do you feel about these changes in their current state?
@wardi conflicts resolved. Still an issue with the deleted records return for large datasets do you think? (https://github.com/open-data/ckan/pull/200)
Yes we'll want a hard limit on the records returned from delete because it will consume an arbitrary amount of memory, or time out and fail the operation.
We should also be consistent about returning the records as records from all actions, or change the delete parameter to include_deleted_records if you prefer returning them as deleted_records.
@wardi okay I changed include_records -> include_deleted_records for datastore_delete and datastore_records_delete actions and schemas.
I also updated the code in the postgres.py for delete_data for the LIMIT in the returning statement. I just used the definition of ckan.datastore.search.rows_max but lemme know if it should just be a hard coded limit (or a new, different configurable option)
@wardi and that is why you need to keep telling me to write tests for everything. When I changed the key in the records_delete schema, I did not add a pop to the cloned dict in the psql validation method. So that is added now
Unfortunately, this errors with dsaudit -- setting the include_records to default to false means that by default, anything passed through datastore_upsert is just going to be a bare response for success, and not include the records in the dictionary.
dsaudit/actions.py unconditionally checks the records parameter here: https://github.com/ckan/ckanext-dsaudit/blob/master/ckanext/dsaudit/actions.py#L116
activity_data = {
'fields': [
f for f in srval['fields']
if any(f['id'] in r for r in rval['records'])
],
'records': rval['records'],
'method': rval.get('method', 'upsert'),
'resource_id': res.id,
}
There are a couple of options:
- dsaudit can force return values (if this is merged??) and then drop them from the response if they would otherwise be false or undefined.
- we can just skip the dsaudit if the response isn't returned.
- or this patch can return [] for the records to preserve the return type.
@EricSoroos I've added a fix to dsaudit that should address the incompatibility. If that looks good to you please go ahead and merge these changes.