spyglass icon indicating copy to clipboard operation
spyglass copied to clipboard

Proposal for tracking row-based provenance

Open rly opened this issue 5 months ago • 2 comments

For common tables, users may accidentally (or intentionally) modify a row created by another user. So it is useful to know who was the last person who modified a row and when. Similarly, it is useful to know who created the row in the first place.

A couple types of actions (declare table, drop table, delete from table) are tracked in the DataJoint hidden ~log table that is part of each database (e.g., common_nwbfile), but those are limited to table-wide actions and deletions, and it may be hard to parse for everyday users.

There are many ways to do this, but since we care only about the latest state of each row, one common way seems to be to add columns to the data tables about the latest state:

  • created_by or creation_user (user) (with or without IP address)
  • created_on or created_time or creation_timestamp (datetime)
  • last_modified_by (user)
  • last_modified_on or last_modified_time or last_modification_timestamp (datetime)
  • last_modified_reason (string, e.g., "database migration" or "fix typo") (not sure if useful)
  • last_modified_spyglass_version (string) (not sure if useful)
  • (we could also add fields like spyglass_environment - a text-based dump of the virtual environment, host name, and operating system, but I think that is overkill and would result in a bloated database. we have thought about this for NWB and are settling on version of the primary software package as being good enough for our initial attempt at tracking provenance)

This is now doable in the latest version of DataJoint (not released yet?) without cluttering the common display of tables (i.e., the columns are "hidden" but can be queried/displayed).

Alternative: we could create a history table for every data table and add a foreign key from the data table to the history table. I'm not sure if this separation adds any value now that we can have hidden columns.

Concern: Both approaches will increase the size of the database. Is it worthwhile?

It would be nice if this were built into DataJoint as mysql triggers, but until then, we could add the values every time we call populate.

@CBroz1 noted:

I think it would require the mixin intercepting the definition attr before table declaration and appending new lines. There would also be some migration effort of altering all existing tables

Questions:

  1. Is this enhancement worth making? How often are tables modified in a way that it is useful to track who/when/why a change was made?
  2. If yes, what do you think about the approach?

rly avatar Sep 16 '24 20:09 rly