Add CKAN Activity for Schema Changes
Request Create CKAN activity when _fields_match() detects schema changes.
Why It's Needed Problem: When users upload files with changed schemas (different columns, data types), the changes happen silently. Dataset managers and API consumers have no way to know their data structure changed without manually checking each file.
Impact:
- API integrations break unexpectedly when column types change
- Dataset managers can't track data quality issues
- No audit trail for schema modifications
- Users waste time debugging "missing column" errors
How Users Will Use It Via Activity API:
# Get all schema changes for a dataset
GET /api/3/action/package_activity_list?id=dataset-id
Use Cases: Monitoring dashboards can alert on schema changes
ETL pipelines can detect when to update their processing
Dataset managers get notifications via activity feeds
API consumers can subscribe to schema change events
Implementation When: _fields_match() returns NAME_MATCH or MISMATCH
Action: Call activity_create with type "changed schema"
Config: ckanext.xloader.schema_change_activity.enabled = true
Activity Data
{
"object_id": "resource_id",
"activity_type": "changed schema",
"data": {
"resource_id": "resource_id",
"resource_name": "data.csv",
"schema_change_type": "NAME_MATCH" // FieldMatch enum value
}
}
We've successfully implemented this schema change activity feature for our data.gov site and it's working well. Would this be useful as a contribution to the main XLoader extension? We're happy to create a PR if there's interest from other CKAN deployments.
Hi @cgoldshtein, It may be needing to be in two parts. A hook in xloader (and possibly for parity inside datapusher also) and a new extension with proposal to have it included by default into https://github.com/ckan/ckan/blob/master/ckanext/datastore since that is the 'public' facing location for said calls.
There was also work done by @JVickery-TBS in having xloader to hold off until schema validation checks are completed and if passing notify xloader to replace the datastore. i.e. https://github.com/search?q=repo%3Ackan%2Fckanext-xloader%20validation&type=code
I think both ways is useful, a 'fixed' dataset is locked to schema and won't change, and ability to allow dynamic datasets and etl/glue layers to accomodate via change schema history.
how does that sound?
Regards,
@duttonw
Hi @duttonw,
Great suggestion! I'll implement the two-part approach:
Part 1: I'll add an after_schema_change hook to IXloader interface that will be called when _fields_match() detects changes.
Part 2: I'll use my separate extension (that already exists) for activity creation.
Best regards, @cgoldshtein
Hi @cgoldshtein ,
For your new plugin, please borrow the https://github.com/ckan/ckanext-xloader/blob/master/.github/workflows/publish.yml and setup pypi auto publish for yourself. You can use an org for the project and when public is the same as user account. You can also setup an org inside pypi (sadly it may take up to 6 months to get approved) for your project space so you can assign members instead of individual users per pypi package. (and no pypi does not do namespace like npmjs does)
I've kept it minimal i.e. it does not do auto increments and its 'tricky' due to all the possible mutations that normal pypi packages versioning can do. It does at least ensure that 'tagged' versions must have matching 'version' inside the project. I wish it was as simple as npm lets it be https://github.com/qld-gov-au/qgds-bootstrap5/blob/develop/.github/workflows/update.yml#L95