ckanext-xloader icon indicating copy to clipboard operation
ckanext-xloader copied to clipboard

Add CKAN Activity for Schema Changes

Open cgoldshtein opened this issue 1 month ago • 4 comments

Request Create CKAN activity when _fields_match() detects schema changes.

Why It's Needed Problem: When users upload files with changed schemas (different columns, data types), the changes happen silently. Dataset managers and API consumers have no way to know their data structure changed without manually checking each file.

Impact:

  • API integrations break unexpectedly when column types change
  • Dataset managers can't track data quality issues
  • No audit trail for schema modifications
  • Users waste time debugging "missing column" errors

How Users Will Use It Via Activity API:

# Get all schema changes for a dataset
GET /api/3/action/package_activity_list?id=dataset-id

Use Cases: Monitoring dashboards can alert on schema changes

ETL pipelines can detect when to update their processing

Dataset managers get notifications via activity feeds

API consumers can subscribe to schema change events

Implementation When: _fields_match() returns NAME_MATCH or MISMATCH

Action: Call activity_create with type "changed schema"

Config: ckanext.xloader.schema_change_activity.enabled = true

Activity Data

{
  "object_id": "resource_id",
  "activity_type": "changed schema", 
  "data": {
    "resource_id": "resource_id",
    "resource_name": "data.csv",
    "schema_change_type": "NAME_MATCH"  // FieldMatch enum value
  }
}

cgoldshtein avatar Nov 03 '25 13:11 cgoldshtein

We've successfully implemented this schema change activity feature for our data.gov site and it's working well. Would this be useful as a contribution to the main XLoader extension? We're happy to create a PR if there's interest from other CKAN deployments.

cgoldshtein avatar Nov 03 '25 13:11 cgoldshtein

Hi @cgoldshtein, It may be needing to be in two parts. A hook in xloader (and possibly for parity inside datapusher also) and a new extension with proposal to have it included by default into https://github.com/ckan/ckan/blob/master/ckanext/datastore since that is the 'public' facing location for said calls.

There was also work done by @JVickery-TBS in having xloader to hold off until schema validation checks are completed and if passing notify xloader to replace the datastore. i.e. https://github.com/search?q=repo%3Ackan%2Fckanext-xloader%20validation&type=code

I think both ways is useful, a 'fixed' dataset is locked to schema and won't change, and ability to allow dynamic datasets and etl/glue layers to accomodate via change schema history.

how does that sound?

Regards,

@duttonw

duttonw avatar Nov 03 '25 21:11 duttonw

Hi @duttonw,

Great suggestion! I'll implement the two-part approach:

Part 1: I'll add an after_schema_change hook to IXloader interface that will be called when _fields_match() detects changes.

Part 2: I'll use my separate extension (that already exists) for activity creation.

Best regards, @cgoldshtein

cgoldshtein avatar Nov 04 '25 08:11 cgoldshtein

Hi @cgoldshtein ,

For your new plugin, please borrow the https://github.com/ckan/ckanext-xloader/blob/master/.github/workflows/publish.yml and setup pypi auto publish for yourself. You can use an org for the project and when public is the same as user account. You can also setup an org inside pypi (sadly it may take up to 6 months to get approved) for your project space so you can assign members instead of individual users per pypi package. (and no pypi does not do namespace like npmjs does)

I've kept it minimal i.e. it does not do auto increments and its 'tricky' due to all the possible mutations that normal pypi packages versioning can do. It does at least ensure that 'tagged' versions must have matching 'version' inside the project. I wish it was as simple as npm lets it be https://github.com/qld-gov-au/qgds-bootstrap5/blob/develop/.github/workflows/update.yml#L95

duttonw avatar Nov 04 '25 11:11 duttonw