Open
MoralCode
opened this issue 3 weeks ago
•
1 comments
Description
This PR resolves and reconciles the differences between the current, in-database data schema and the schema defined by the SQLAlchemy models. This is needed because, for the past few years, both the models and the migrations have had to be made manually, meaning a lack of perfect consistency between updates to the two sources can cause drift between them.
The recent changes in #3416 configured alembic to automatically compare the data models and the deployed schema and generate migrations that represent the differences. This PR reconciles the various differences that have accumulated over the years.
This PR is one step towards, but not yet a complete fix for #3388.
Notes for Reviewers
Most of the schema conflicts were resolved with relatively minor adjustments to the data models (such as specifying names for Foreign keys, adding fields to unique constraints, etc) to bring them in line with what actual deployed databases have.
There were a few places where I thought it was better to update the database instead:
There was an index on contributors.gh_login that I dropped, since I believe gh_login has been superseded by another method of looking up contributors for a while now
a unique constraint on the pull_request_reviews table was updated to add tool_source to its uniqueness check.
the augur_operations.users.email_verified column was dropped since a recent maintainers call confirmed that email verification was a feature that was previously being explored as part of a password reset feature, but was ultimately dropped due to having limited utility (wasnt worth the the additional overhead of configuring sendgrid for email processing compared to a hypothetical future implementation that uses the Augur CLI to enable admins to perform password resets)
Please look over all the changes and let me know if you think any of these changes needs to be reconciled in a different way (i.e. a model change is incorrect and the database needs to change instead)
I have yet to run this on an actual augur install but plan to soon.
This change is a pretty severe blocker for other database changes I would like to make since I want to start from a relatively clean/consistent state when proposing additional migrations (including some that are important for the next release)
Signed commits
[X] Yes, I signed my commits.
Gen AI Disclosure
Generative AI was only used for the planning stages of this changes - for example, looking at the schemas or alembic-generated migrations and helping summarize what the differences were between the models and the database. All of these changes were understood/verified and implemented by me.
OK so it seems like the smoke tests revealed an interesting bug
Stack Trace
```
augur-1 | Traceback (most recent call last):
augur-1 | File "/augur/.venv/bin/augur", line 10, in
augur-1 | sys.exit(run())
augur-1 | ^^^^^
augur-1 | File "/augur/.venv/lib/python3.11/site-packages/click/core.py", line 1462, in __call__
augur-1 | return self.main(*args, **kwargs)
augur-1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^
augur-1 | File "/augur/.venv/lib/python3.11/site-packages/click/core.py", line 1383, in main
augur-1 | rv = self.invoke(ctx)
augur-1 | ^^^^^^^^^^^^^^^^
augur-1 | File "/augur/.venv/lib/python3.11/site-packages/click/core.py", line 1850, in invoke
augur-1 | return _process_result(sub_ctx.command.invoke(sub_ctx))
augur-1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
augur-1 | File "/augur/.venv/lib/python3.11/site-packages/click/core.py", line 1850, in invoke
augur-1 | return _process_result(sub_ctx.command.invoke(sub_ctx))
augur-1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
augur-1 | File "/augur/.venv/lib/python3.11/site-packages/click/core.py", line 1246, in invoke
augur-1 | return ctx.invoke(self.callback, **ctx.params)
augur-1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
augur-1 | File "/augur/.venv/lib/python3.11/site-packages/click/core.py", line 814, in invoke
augur-1 | return callback(*args, **kwargs)
augur-1 | ^^^^^^^^^^^^^^^^^^^^^^^^^
augur-1 | File "/augur/.venv/lib/python3.11/site-packages/click/decorators.py", line 34, in new_func
augur-1 | return f(get_current_context(), *args, **kwargs)
augur-1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
augur-1 | File "/augur/augur/application/cli/__init__.py", line 45, in new_func
augur-1 | return ctx.invoke(function_internet_connection, *args, **kwargs)
augur-1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
augur-1 | File "/augur/.venv/lib/python3.11/site-packages/click/core.py", line 814, in invoke
augur-1 | return callback(*args, **kwargs)
augur-1 | ^^^^^^^^^^^^^^^^^^^^^^^^^
augur-1 | File "/augur/.venv/lib/python3.11/site-packages/click/decorators.py", line 34, in new_func
augur-1 | return f(get_current_context(), *args, **kwargs)
augur-1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
augur-1 | File "/augur/augur/application/cli/__init__.py", line 57, in new_func
augur-1 | return ctx.invoke(function_db_connection, *args, **kwargs)
augur-1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
augur-1 | File "/augur/.venv/lib/python3.11/site-packages/click/core.py", line 814, in invoke
augur-1 | return callback(*args, **kwargs)
augur-1 | ^^^^^^^^^^^^^^^^^^^^^^^^^
augur-1 | File "/augur/.venv/lib/python3.11/site-packages/click/decorators.py", line 34, in new_func
augur-1 | return f(get_current_context(), *args, **kwargs)
augur-1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
augur-1 | File "/augur/augur/application/cli/__init__.py", line 105, in new_func
augur-1 | return ctx.invoke(f, *args, **kwargs)
augur-1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
augur-1 | File "/augur/.venv/lib/python3.11/site-packages/click/core.py", line 814, in invoke
augur-1 | return callback(*args, **kwargs)
augur-1 | ^^^^^^^^^^^^^^^^^^^^^^^^^
augur-1 | File "/augur/.venv/lib/python3.11/site-packages/click/decorators.py", line 34, in new_func
augur-1 | return f(get_current_context(), *args, **kwargs)
augur-1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
augur-1 | File "/augur/augur/application/cli/config.py", line 105, in init_config
augur-1 | config.load_config_from_dict(augmented_config)
augur-1 | File "/augur/augur/application/config.py", line 316, in load_config_from_dict
augur-1 | self.add_section_from_json(section_name=section_name, json_data=value)
augur-1 | File "/augur/augur/application/config.py", line 282, in add_section_from_json
augur-1 | writeable_config.create_section(section_name, json_data, ignore_existing=True)
augur-1 | File "/augur/augur/application/config.py", line 708, in create_section
augur-1 | self.add_value(section_name, key, value, ignore_existing=ignore_existing)
augur-1 | File "/augur/augur/application/config.py", line 741, in add_value
augur-1 | if not self.has_value(section_name, value_key):
augur-1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
augur-1 | File "/augur/augur/application/config.py", line 731, in has_value
augur-1 | query = self.session.query(Config).filter(and_(Config.section_name == section_name,Config.setting_name == value_key) )
augur-1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^
augur-1 | File "/augur/.venv/lib/python3.11/site-packages/sqlalchemy/orm/session.py", line 2910, in query
augur-1 | return self._query_cls(entities, self, **kwargs)
augur-1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
augur-1 | File "/augur/.venv/lib/python3.11/site-packages/sqlalchemy/orm/query.py", line 276, in __init__
augur-1 | self._set_entities(entities)
augur-1 | File "/augur/.venv/lib/python3.11/site-packages/sqlalchemy/orm/query.py", line 288, in _set_entities
augur-1 | self._raw_columns = [
augur-1 | ^
augur-1 | File "/augur/.venv/lib/python3.11/site-packages/sqlalchemy/orm/query.py", line 289, in
augur-1 | coercions.expect(
augur-1 | File "/augur/.venv/lib/python3.11/site-packages/sqlalchemy/sql/coercions.py", line 406, in expect
augur-1 | insp._post_inspect
augur-1 | File "/augur/.venv/lib/python3.11/site-packages/sqlalchemy/util/langhelpers.py", line 1253, in __get__
augur-1 | obj.__dict__[self.__name__] = result = self.fget(obj)
augur-1 | ^^^^^^^^^^^^^^
augur-1 | File "/augur/.venv/lib/python3.11/site-packages/sqlalchemy/orm/mapper.py", line 2707, in _post_inspect
augur-1 | self._check_configure()
augur-1 | File "/augur/.venv/lib/python3.11/site-packages/sqlalchemy/orm/mapper.py", line 2386, in _check_configure
augur-1 | _configure_registries({self.registry}, cascade=True)
augur-1 | File "/augur/.venv/lib/python3.11/site-packages/sqlalchemy/orm/mapper.py", line 4199, in _configure_registries
augur-1 | _do_configure_registries(registries, cascade)
augur-1 | File "/augur/.venv/lib/python3.11/site-packages/sqlalchemy/orm/mapper.py", line 4240, in _do_configure_registries
augur-1 | mapper._post_configure_properties()
augur-1 | File "/augur/.venv/lib/python3.11/site-packages/sqlalchemy/orm/mapper.py", line 2403, in _post_configure_properties
augur-1 | prop.init()
augur-1 | File "/augur/.venv/lib/python3.11/site-packages/sqlalchemy/orm/interfaces.py", line 579, in init
augur-1 | self.do_init()
augur-1 | File "/augur/.venv/lib/python3.11/site-packages/sqlalchemy/orm/relationships.py", line 1637, in do_init
augur-1 | self._setup_join_conditions()
augur-1 | File "/augur/.venv/lib/python3.11/site-packages/sqlalchemy/orm/relationships.py", line 1882, in _setup_join_conditions
augur-1 | self._join_condition = jc = JoinCondition(
augur-1 | ^^^^^^^^^^^^^^
augur-1 | File "/augur/.venv/lib/python3.11/site-packages/sqlalchemy/orm/relationships.py", line 2315, in __init__
augur-1 | self._check_foreign_cols(self.primaryjoin, True)
augur-1 | File "/augur/.venv/lib/python3.11/site-packages/sqlalchemy/orm/relationships.py", line 2958, in _check_foreign_cols
augur-1 | raise sa_exc.ArgumentError(err)
augur-1 | sqlalchemy.exc.ArgumentError: Could not locate any relevant foreign key columns for primary join condition 'augur_data.commits.cmt_author_platform_username = augur_data.contributors.cntrb_login' on relationship Contributor.commits. Ensure that referencing columns are associated with a ForeignKey or ForeignKeyConstraint, or are annotated in the join condition with the foreign() annotation.
```
Seems like the removal of the foreign key from the DB model is causing issues with the SQLAlchemy model given that the FK's removed from the model (which are not present in any DB i can see as far as I can tell) were in fact being relied on by other parts of the model, specifically the relationship() between commits and contributors that allows an SA model for a contributor to contain a property referencing that persons commits and vise versa.
Hey @MoralCode, thanks for tackling this schema drift cleanup. Found one issue that will likely cause problems:
The Sequence() calls are missing schema='augur_data'. Per SQLAlchemy docs:
When using tables with explicit schema names, the configured schema of the Table is not automatically shared by an embedded Sequence, instead, specify Sequence.schema
The old server_default=text("nextval('augur_data.xxx_seq'::regclass)") worked because it referenced the schema explicitly. The new Sequence('xxx_seq', start=N) will look in public schema (or whatever search_path resolves to) instead of augur_data.
Also, might be worth validating no duplicates exist for gh_login and cntrb_login before the unique constraints get added - especially on systems with manually altered DBs.
2. Also, might be worth validating no duplicates exist for gh_login and cntrb_login before the unique constraints get added - especially on systems with manually altered DBs.
Does alembic have a mechanism to run pre-migration checks like this? (or even just a mechanism to exit early from a migration with an error if the precondition isnt met?) Would be nice to be able to have migrations be either applied or not applied, rather than potentially partially applied
Yeah, you can just run validation queries at the start of upgrade() and raise an exception if preconditions aren't met. Alembic runs migrations in a transaction, so if it fails, the whole thing rolls back. no partial application.
def upgrade():
conn = op.get_bind()
result = conn.execute(text("""
SELECT gh_login, COUNT(*) as cnt FROM augur_data.contributors
WHERE gh_login IS NOT NULL
GROUP BY gh_login HAVING COUNT(*) > 1
LIMIT 5
"""))
dupes = result.fetchall()
if dupes:
raise Exception(f"Cannot apply migration: found duplicate gh_login values: {dupes}")
result = conn.execute(text("""
SELECT cntrb_login, COUNT(*) as cnt FROM augur_data.contributors
WHERE cntrb_login IS NOT NULL
GROUP BY cntrb_login HAVING COUNT(*) > 1
LIMIT 5
"""))
dupes = result.fetchall()
if dupes:
raise Exception(f"Cannot apply migration: found duplicate cntrb_login values: {dupes}")
# Actual migration operations...
The LIMIT 5 is just to keep the error message readable if there are hundreds of duplicates - can drop it or use LIMIT 1 for a simple yes/no check.
The exception aborts the transaction before any schema changes are committed.
@shlokgilda problem: looking into the gh-login table duplication in our instance, im noticing that, at least for this random user i picked, it seems to actually be two separate users (one github, one gitlab) with the same username. It seems like its not possible for these colums to have a unique constraint applied to them.
also: im seeing lots of gh_* prefixed columns being used for gitlab data too...
Ah, that's a data model issue then, not dirty data. If the same username can legitimately exist on both GitHub and GitLab as different people, then gh_login alone can't be unique.
Options:
Composite unique constraint. Make it unique per platform:
UniqueConstraint('gh_login', 'data_source', name='GH-UNIQUE-C', ...)
# Though this assumes data_source reliably distinguishes GitHub vs GitLab contributors.
Drop the unique constraint entirely - if it's not actually enforceable, don't add it. The migration 22 dropped it for a reason (maybe this exact issue?). The FK fk_commits_contributors_3 might still work if you're joining on cntrb_login which could be more reliably unique.
Separate the concerns - the gh_* columns are clearly meant for GitHub data, but if GitLab data is being stuffed in there too, that's a schema design issue worth documenting at minimum.
Re: GitLab data in gh_* columns - yeah that's messy. Probably happened organically when GitLab support was added? Definitely worth discussing.
Worth checking: does the code actually rely on gh_login being unique anywhere? Or was this constraint just assumed/aspirational? If nothing depends on it, simplest path is to just not add it.
3. Separate the concerns - the gh_* columns are clearly meant for GitHub data, but if GitLab data is being stuffed in there too, that's a schema design issue worth documenting at minimum.
already documented in #3469
2. The migration 22 dropped it for a reason (maybe this exact issue?)
Ah yeah thats pretty decent evidence that I resolved the conflict in this PR potentially the wrong way. Will update
ok next issue: foreign key loop in the SPDX schema between package.package_id and package_files
seems like that may be just a warning though. seems interesting that alembic is generating an empty migration rather than saying the database is up to date...