augur unify multiple methods of initializing the database schema

One of the challenges is that we have several ways to initialize the schema and they are all different.

the .sql file you use here is pretty static/independent, likely somewhat dated, and also failed to work when i messed with it as part of my testing (the way it populates data from stdin seemed to be failing)
replaying the schema migrations (what the Augur CLI does) seems to be the main way, but it requires building augur enough to access alembic and stuff
creating the schemas from the SQLAlchemy models seem to me missing/not creating some required sequences, making it hard to inititalize things

I want to get these all in sync so that this testing is less fragile in future

Originally posted by @MoralCode in https://github.com/chaoss/augur/issues/3387#issuecomment-3512811847

Nov 10 '25 16:11 MoralCode

Im told at one point in time alembic and sqlalchemy used to be in sync. Unsure when that was though.

My ideal future - each tool has a dedicated purpose:

stop storing raw SQL files in the repo unless they are really small/focused and contain data thats important for a test scenario
Use the database model we have already built in SqlAlchemy to do fresh schema initialization. The process that does this should also stamp the schema with the current alembic HEAD version. This also makes setting up augur for the first time faster because theres no need to replay every single migration since the sqlalchemy schema should always represent the newest version of the schema.
use alembic ONLY for migrating between schema versions. i.e. it should only be used as a bridge to help people with older augur installs cleanly update their databases to newer schemas

Nov 10 '25 16:11 MoralCode

Concrete example:

Trying to use Base.metadata.create_all(engine) to initialize a DB with sqlalchemy leads to:

E       sqlalchemy.exc.ProgrammingError: (psycopg2.errors.UndefinedTable) relation "augur_data.chaoss_metric_status_cms_id_seq" does not exist
E       LINE 3:  cms_id BIGINT DEFAULT nextval('augur_data.chaoss_metric_sta...
E                                              ^
E       
E       [SQL: 
E       CREATE TABLE augur_data.chaoss_metric_status (
E               cms_id BIGINT DEFAULT nextval('augur_data.chaoss_metric_status_cms_id_seq'::regclass) NOT NULL, 
E               cm_group VARCHAR, 
E               cm_source VARCHAR, 
E               cm_type VARCHAR, 
E               cm_backend_status VARCHAR, 
E               cm_frontend_status VARCHAR, 
E               cm_defined BOOLEAN, 
E               cm_api_endpoint_repo VARCHAR, 
E               cm_api_endpoint_rg VARCHAR, 
E               cm_name VARCHAR, 
E               cm_working_group VARCHAR, 
E               cm_info JSON, 
E               tool_source VARCHAR, 
E               tool_version VARCHAR, 
E               data_source VARCHAR, 
E               data_collection_date TIMESTAMP(0) WITHOUT TIME ZONE DEFAULT CURRENT_TIMESTAMP, 
E               cm_working_group_focus_area VARCHAR, 
E               PRIMARY KEY (cms_id)
E       )
E       
E       ]
E       (Background on this error at: https://sqlalche.me/e/20/f405)

.venv/lib/python3.11/site-packages/sqlalchemy/engine/default.py:922: ProgrammingError

This being the first referenced sequence in this file makes me suspect that sequences are not being properly handled at the moment

Nov 10 '25 19:11 MoralCode

@MoralCode : The Schema was originally created with that SQL file, and at one moment in our history we started using Alembic to make schema changes. We did not fully reverse engineer the in place schema at that time (~6 years ago) to be built with Alembic scripts. @ABrain7710 and @IsaacMilarky are the ones who implemented Alembic when we moved in that direction.

So, I think my question is: Yes .. "We know" ... and "it works" ... if building the schema from the ground up using Alembic is a goal we want to set, then we can't start from where we are on the scripts. We have to start from what the database "is".

Nov 13 '25 17:11 sgoggins

oh no i'm not saying that we need to retroactively rebuild the full set of schema migrations in alembic (that would also be disruptive to anyone currently using basically any version of the augur schema). I think this is what I was thinking before I fixed some of the alembic issues, but now that ive worked out some issues with alembic's automatic migration generation, its looking a lot more reasonable. I now think we can do this in a relatively painless way.

because most of the recent changes to the schema in alembic have happened basically manually with contributors having to change BOTH the sqlalchemy model and manually write an alembic migration, every migration carries with it a non-zero chance that something is slightly different between them.

As a concrete example, in #3397, the topic model meta table had a foreign key referring to the repo ID. In the SA model this was set up as a ForeignKey (correct). However in the DB migration it was set up as an Integer column (the underlying type for ForeignKey is actually something like BigInteger). This created a mismatch between the schema as viewed by alembic, and the schema as modeled in SqlAlchemy (and therefore as used by our code).

My goal with this is to bring these two things in sync with each other. Then we can swap to using the DB model objects as the source of truth and use the feature of alembic that can autogenerate migrations with one command when that source of truth changes. This would also allow us to easily initialize new databases (such as for creating blank environments for unit testing).

Nov 13 '25 20:11 MoralCode

Ok so heres a somewhat organized view of the differences between the migrations and the models that have accumulated over time.

Note This excludes downgrades (they are just the reverse of this, we will think about those later) and the topic modeling fixes (that was the bulk of it).

A "create" action here means that it is something that exists in our model but not in the database a "delete" action here means that it exists in the DB but not the model.

Would love to schedule a time for a more in depth discussion with @sgoggins to decide on each of these based on their merits.

# indexes: 
    op.drop_index(op.f('gh_login'), table_name='contributors', schema='augur_data')
    op.drop_index(op.f('pr_id_pr_files'), table_name='pull_request_files', schema='augur_data')
    op.drop_index(op.f('pr_id_pr_reviews'), table_name='pull_request_reviews', schema='augur_data')
    op.drop_index(op.f('pr_ID_prs_table'), table_name='pull_requests', schema='augur_data')
    op.drop_index(op.f('id_node'), table_name='pull_requests', schema='augur_data')
    op.create_index('id_node', 'pull_requests', ['pr_src_id', 'pr_src_node_id'], unique=False, schema='augur_data')

    

# constraints
    op.drop_constraint(op.f('cmt_ght_author_cntrb_id_fk'), 'commits', schema='augur_data', type_='foreignkey')
    op.create_unique_constraint('GH-UNIQUE-C', 'contributors', ['gh_login'], deferrable=True, initially='DEFERRED', schema='augur_data')
    op.create_unique_constraint('GL-cntrb-LOGIN-UNIQUE', 'contributors', ['cntrb_login'], schema='augur_data')
    op.drop_constraint(op.f('pr_events_repo_id_event_src_id_unique'), 'pull_request_events', schema='augur_data', type_='unique')
    
    op.drop_constraint(op.f('pr_review_unique'), 'pull_request_reviews', schema='augur_data', type_='unique')
    op.create_unique_constraint(None, 'pull_request_reviews', ['pr_review_src_id', 'tool_source'], schema='augur_data')
    
    op.drop_constraint(op.f('repo_src_id_unique'), 'repo', schema='augur_data', type_='unique')
    op.drop_constraint(op.f('deps_scorecard_new_unique'), 'repo_deps_scorecard', schema='augur_data', type_='unique')
    op.create_unique_constraint('deps-scorecard-insert-unique', 'repo_deps_scorecard', ['repo_id', 'name'], schema='augur_data')
    op.drop_constraint(op.f('user_groups_user_id_name_key'), 'user_groups', schema='augur_operations', type_='unique')
    op.create_unique_constraint('user_group_unique', 'user_groups', ['user_id', 'name'], schema='augur_operations')
# foreign keys
op.create_foreign_key('fk_commits_contributors_3', 'commits', 'contributors', ['cmt_author_platform_username'], ['cntrb_login'], source_schema='augur_data', referent_schema='augur_data', onupdate='CASCADE', ondelete='CASCADE', initially='DEFERRED', deferrable=True)
    op.create_foreign_key(None, 'commits', 'contributors', ['cmt_ght_author_id'], ['cntrb_id'], source_schema='augur_data', referent_schema='augur_data')
    

# columns
    op.alter_column('releases', 'release_id',
               existing_type=sa.CHAR(length=256),
               type_=sa.CHAR(length=128),
               existing_nullable=False,
               existing_server_default=sa.text("nextval('augur_data.releases_release_id_seq'::regclass)"),
               schema='augur_data')
    op.alter_column('user_groups', 'user_id',
               existing_type=sa.INTEGER(),
               nullable=True,
               schema='augur_operations')
  
    op.alter_column('user_session_tokens', 'user_id',
               existing_type=sa.INTEGER(),
               nullable=True,
               schema='augur_operations')

    op.alter_column('user_session_tokens', 'application_id',
               existing_type=sa.VARCHAR(),
               nullable=False,
               schema='augur_operations')


    op.drop_column('users', 'email_verified', schema='augur_operations')

Nov 13 '25 21:11 MoralCode

i put slightly more plain english versions for this in the TSC google doc for discussion by maintainers

Nov 13 '25 22:11 MoralCode