models: decide on an appropriate key
The originally-proposed was model_path, but this column was removed in June 2014.
@mdietze (model_name, revision)?
Yeah, it's probably safe to say that ( model_name, revision ) should be unique. I am nervous, however, that the rush to add a ton of constraints is going to continue to break the database sync system, in particular as the database expands to more and more modeling teams.
I appreciate your concern regarding the implementation of constraints.
the rush to add a ton of constraints
This isn't rushed ... we started defining and reviewing constraints last summer. The implementation was delayed until after AGU, so we are trying to move forward along a steady, clearly defined path: Define constraints --> prepare database --> implement constraints. If it seems rushed, it may be the amount of activity due to the fact that I asked Scott to create an issue for each constraint that defines the constraint and identifies problems with the existing data. Cleaning these up will take longer.
... is going to continue to break the database sync system,
We can continue to work quickly to fix any errors that occur in the syncing system (e.g., by removing constraints that break the sync). Any suggestions for a better way to prevent / handle such issues in the future are welcome.
However, if the constraints are correctly defined, then most of what we are doing is programmatically implementing our concept of the underlying data model. This process has helped refine the meaning of the schema as well as identify and correct many orphaned, duplicated, and meaningless records - in many cases these errors are the legacy of the original bulk data uploads, as well as the evolving interface.
I think we are doing this in a way that minimizes potential disruptions of the syncing, and when it does, @gsrohde has responded in a timely fashion. .
If there is trouble with syncing the database, then we should come up with another solution - whether it be the way we test before deploying, when and where constraints are implemented, and etc. In any case, the sooner we implement these, the less work it will be in the future to clean up corrupt data.
in particular as the database expands to more and more modeling teams.
the idea of the constraints is to make the system more robust by integrating mechanisms in the schema that prevent corrupt data from being entered or propagating.
So I actually think this (a growing community of users) is more a motivation for such checks than a reason not to implement them (in the same way that package testing helps define what we intend to do and constantly check that we are still doing it.
There are two issues when it comes to syncs. The short term issue is the addition of constraints breaking the sync due to legacy database errors -- this could be resolved by doing sync tests before pushing constraints to the live database. The longer term issue is that the constraints that are being imposed were proposed prior to getting the sync system up and running. There may be constrains that made sense on a single database that may no longer make sense if anyone can enter a record on any machine. In that case the conflicts do not appear until the sync is done. This then becomes an impediment to users and we simply do not have a system in place for resolving these conflicts. For example if two different users add the same site, who's is the 'correct' record, and who has to change their records? Furthermore, since only the users of an database instance can make changes to that instance, even if 'my' record is the 'correct' one, I can't change the other person's database. I think the list of constraints needs to be revisited in this light and reduced to the minimum needed. I'd prefer to handle more of the record issues via QAQC rather than imposing constrains.
The "rush" feeling doesn't come from there being a lot of issues posted (I've only seen the ones you've included me on), but the feeling that those of us at BU and NCSA had to scramble in order to fix problems created by adding these constraints at a time that we were already overworked to meet our own PEcAn release deadlines in advance of the NACP 2015 meeting. We'd really appreciate some advance warning and better communication on these things (and better testing) so that we're not getting a new BETY version the day before a major meeting. Right now I'm only learning about releases after the fact via completely uninformative Twitter messages such as:
This release primarily implements several foreign-key constraints on the database. It consists of a ne...
This release implements changes to the database schema discussed in Github issue #145 and makes a few r...
This incorporates the features and bug fixes from the June 17, 2014 sprint. It was deployed to www.be...
Constraints that made sense on a single database
I had not thought of this. We should probably have a meeting about these issues and how to resolve them. I don't think that allowing invalid or duplicate records is a good answer. Perhaps duplicates could be fixed on import using the function in #185.
advance warning
Please suggest how you would like thhis to be communicated. I was unaware of any issues that arose prior to the NACP meeting - or that this was looming (unlike AGU, which was clear).
uninformative twitter messages ...
These are automatically generated at each release (to solve the earlier problem of a delay between releaase and tweet. We can make them more informative by making sure the first 140 chars are written as a tweet. But they do contain a link to the release notes, which are too extensive to fit in a tweet.
Yes, we should have that conversation
On advanced warning, we've been discussing the PEcAn development in out weekly Skype telecon which you are always encouraged to attend so you know what we're planning and we know what you're planning
On Twitter, your messages do NOT include links to the release notes
Sorry. I'll fix the tweets.
I didn't realize the weekly telecon had started up again this semester.