pyani icon indicating copy to clipboard operation
pyani copied to clipboard

Use `alembic` to manage SQLite schema changes?

Open widdowquinn opened this issue 2 years ago • 41 comments

Summary:

Use alembic or similar to manage database migration.

Description:

In discussions with users, modifications to the SQLite database schema have caused issues where updating their v0.3 installation has led to incompatibility with an existing database, and resulting errors.

We could potentially avoid this by carefully versioning the SQLite/SQLAlchemy schema and using alembic or similar to aid migration between versions in most cases.

If users maintain large analyses in their databases, we should avoid discouraging them from trusting the package by breaking backwards-compatibility.

Having used this before with Flask applications, I'm confident we could usually achieve seamless in-place upgrades of users' existing local databases.

widdowquinn avatar Jan 28 '22 11:01 widdowquinn

Can you share any of these discussions with users? That would give me a more concrete idea of the problem this solves. I imagine it's the same issue I had between the fastANI branch and others, though.

baileythegreen avatar Jan 28 '22 17:01 baileythegreen

You're the user with the most experience of these incompatibilities, for sure ;)

We have talked about using alembic for this before, and it's good practice in any case.

I don't have much concrete, but here's a related screenshot as I helped Angelika understand her issue with the update:

image

widdowquinn avatar Jan 29 '22 11:01 widdowquinn

I've been playing with this, and there are a few things I don't seem to have figured out, yet.

  • revision naming, and specifying a specific revision from the command line, vs just saying alembic upgrade head
  • where migration files need to exist in a directory hierarchy; do they go in multidb?
  • allowing multiple databases; I think I am halfway towards this, but something seems incorrect
  • how much of Q&As that talk about Flask is going to be relevant? I'm seeing lots of Flask stuff, despite not searching for it

BUT, I think I have a basic migration script written for the old/new versions of the database (pre-/post- fastANI addition).

baileythegreen avatar Feb 01 '22 05:02 baileythegreen

I have set up the basic requirements for a new versiondb subcommand, and also managed to get alembic working correctly locally. I'm now working on tying the functionality into pyani in a sensible way.

I was going to add a tag in the repo, prior to the merge of fastANI, to capture the database changes, but it looks like tags are directly related to releases, and doing so might be confusing, since the database is only a factor in version 3, and all of the existing releases are version 2.

baileythegreen avatar Feb 04 '22 18:02 baileythegreen

Actually, I have found an issue with the downgrade() function I am testing, which should remove the columns that are added to support fastANI.

The function I am running is written in accordance with examples I've seen:

def downgrade_old_db():
    with op.batch_alter_table("comparisons") as batch_op:
        batch_op.drop_column('kmersize')
        batch_op.drop_column('minmatch')

but nothing happens—note the lack of a 'Running downgrade ...' message here:

(pyani_dev) ! alembic downgrade base
INFO  [alembic.env] Migrating database old_db
INFO  [alembic.runtime.migration] Context impl SQLiteImpl.
INFO  [alembic.runtime.migration] Will assume non-transactional DDL.

Many sources on the internet claim that this is because sqlite does not support the DROP COLUMN functionality of ALTER TABLE, but the documentation for this indicates otherwise. It appears this change was made sometime prior to July 2021, though I have not been able to pinpoint exactly when.

I have, however, confirmed that I can drop columns within my sqlite installation, so I'm puzzled as to why alembic is not managing to do so.

sqlite> .schema comparisons
CREATE TABLE comparisons (
	comparison_id INTEGER NOT NULL, 
	query_id INTEGER NOT NULL, 
	subject_id INTEGER NOT NULL, 
	aln_length INTEGER, 
	sim_errs INTEGER, 
	identity FLOAT, 
	cov_query FLOAT, 
	cov_subject FLOAT, 
	program VARCHAR, 
	version VARCHAR, 
	fragsize INTEGER, 
	maxmatch BOOLEAN, kmersize INTEGER, minmatch FLOAT, 
	PRIMARY KEY (comparison_id), 
	UNIQUE (query_id, subject_id, program, version, fragsize, maxmatch), 
	FOREIGN KEY(query_id) REFERENCES genomes (genome_id), 
	FOREIGN KEY(subject_id) REFERENCES genomes (genome_id)
);
sqlite> alter table comparisons drop kmersize;
sqlite> .schema comparisons
CREATE TABLE comparisons (
	comparison_id INTEGER NOT NULL, 
	query_id INTEGER NOT NULL, 
	subject_id INTEGER NOT NULL, 
	aln_length INTEGER, 
	sim_errs INTEGER, 
	identity FLOAT, 
	cov_query FLOAT, 
	cov_subject FLOAT, 
	program VARCHAR, 
	version VARCHAR, 
	fragsize INTEGER, 
	maxmatch BOOLEAN, minmatch FLOAT, 
	PRIMARY KEY (comparison_id), 
	UNIQUE (query_id, subject_id, program, version, fragsize, maxmatch), 
	FOREIGN KEY(query_id) REFERENCES genomes (genome_id), 
	FOREIGN KEY(subject_id) REFERENCES genomes (genome_id)
);
sqlite> 

baileythegreen avatar Feb 04 '22 19:02 baileythegreen

it looks like tags are directly related to releases, and doing so might be confusing, since the database is only a factor in version 3, and all of the existing releases are version 2.

IIRC, git tags are just labels (though really, only lightweight labels are just labels - annotated tags are proper git objects; both point to specific commits) - but GitHub chooses to interpret tags in a specific way. Each tag is understood to be a release point, but the release point only becomes an actual release if Release Notes are added. GitHub's approach to this is not especially general, IMO.

widdowquinn avatar Feb 07 '22 15:02 widdowquinn

Actually, I have found an issue with the downgrade() function I am testing, which should remove the columns that are added to support fastANI.

The function I am running is written in accordance with examples I've seen:

def downgrade_old_db():
    with op.batch_alter_table("comparisons") as batch_op:
        batch_op.drop_column('kmersize')
        batch_op.drop_column('minmatch')

but nothing happens—note the lack of a 'Running downgrade ...' message here:

(pyani_dev) ! alembic downgrade base
INFO  [alembic.env] Migrating database old_db
INFO  [alembic.runtime.migration] Context impl SQLiteImpl.
INFO  [alembic.runtime.migration] Will assume non-transactional DDL.

Does alembic provide no error message or other useful information (and appear to have "worked" without DROPping the column)?

A quick search turned up at least one workaround someone implemented, but we shouldn't have to do that.

I've seen two mentions of a render_as_batch option which, I think, allows alembic to batch copy the table, modify it accordingly, DROP the old one, and rename the new one:

Miguel Grinberg has a blog post outlining a possible solution.

I expect you'll have seen and tried all of these, mind.

widdowquinn avatar Feb 07 '22 15:02 widdowquinn

Does alembic provide no error message or other useful information (and appear to have "worked" without DROPping the column)?

By default it does not produce any error message, and I have tried including the --raiseerr optional flag to get a full traceback, but it still doesn't seem to output anything.

I was attempting to verify that the issue was with drop table, and not just downgrading specifically, by creating a second revision so I had more things to step through. However, now I am stuck in a loop where alembic claims the database is not up-to-date, but the fastani columns exist. The issue seems to be it's not populating the alembic_version table now, so the upgrade fails partway through, but I don't know why. I think I saw something about this the other day, but haven't managed to find it again.

I think this issue started when I manually deleted columns, to prove I could, though I've since recreated the database. Perhaps something is just not hooked up quite right....

baileythegreen avatar Feb 07 '22 17:02 baileythegreen

I seem to have hacked my way through a series of arcane errors I could find little information about to something that works.

I now have two migration scripts, that I am able to use to both upgrade and downgrade a database reliably. The differences from what I had before are that I did not use the multidb template this time, as it did not seem necessary after further reflection, and I have had to switch from performing the batch operations within a context:

with op.batch_alter_table("comparisons") as batch_op:
        batch_op.add_column(sa.Column('kmersize', sa.Integer))
        batch_op.add_column(sa.Column('minmatch', sa.Float))

to doing so outside of one:

op.add_column('comparisons', sa.Column('kmersize', sa.Integer))
op.add_column('comparisons', sa.Column('minmatch', sa.Float))

This change was necessary because I got the following error, and nothing else both prevents the error, and still performs the changes. I have not been able to figure out why this caused an error when, for instance, I successfully used the former format yesterday.

Traceback (most recent call last):
  File "/Users/baileythegreen/Software/miniconda3/bin/alembic", line 8, in <module>
    sys.exit(main())
  File "/Users/baileythegreen/Software/miniconda3/lib/python3.8/site-packages/alembic/config.py", line 588, in main
    CommandLine(prog=prog).main(argv=argv)
  File "/Users/baileythegreen/Software/miniconda3/lib/python3.8/site-packages/alembic/config.py", line 582, in main
    self.run_cmd(cfg, options)
  File "/Users/baileythegreen/Software/miniconda3/lib/python3.8/site-packages/alembic/config.py", line 559, in run_cmd
    fn(
  File "/Users/baileythegreen/Software/miniconda3/lib/python3.8/site-packages/alembic/command.py", line 320, in upgrade
    script.run_env()
  File "/Users/baileythegreen/Software/miniconda3/lib/python3.8/site-packages/alembic/script/base.py", line 563, in run_env
    util.load_python_file(self.dir, "env.py")
  File "/Users/baileythegreen/Software/miniconda3/lib/python3.8/site-packages/alembic/util/pyfiles.py", line 92, in load_python_file
    module = load_module_py(module_id, path)
  File "/Users/baileythegreen/Software/miniconda3/lib/python3.8/site-packages/alembic/util/pyfiles.py", line 108, in load_module_py
    spec.loader.exec_module(module)  # type: ignore
  File "<frozen importlib._bootstrap_external>", line 848, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/Users/baileythegreen/Software/pyani/scratch/alembic/env.py", line 72, in <module>
    run_migrations_online()
  File "/Users/baileythegreen/Software/pyani/scratch/alembic/env.py", line 66, in run_migrations_online
    context.run_migrations()
  File "<string>", line 8, in run_migrations
  File "/Users/baileythegreen/Software/miniconda3/lib/python3.8/site-packages/alembic/runtime/environment.py", line 851, in run_migrations
    self.get_context().run_migrations(**kw)
  File "/Users/baileythegreen/Software/miniconda3/lib/python3.8/site-packages/alembic/runtime/migration.py", line 620, in run_migrations
    step.migration_fn(**kw)
  File "/Users/baileythegreen/Software/pyani/scratch/alembic/versions/65538af5a5e1_add_nonsense_column.py", line 21, in upgrade
    batch_op.add_column(sa.Column('nonsense', sa.Integer))
  File "/Users/baileythegreen/Software/miniconda3/lib/python3.8/contextlib.py", line 120, in __exit__
    next(self.gen)
  File "/Users/baileythegreen/Software/miniconda3/lib/python3.8/site-packages/alembic/operations/base.py", line 374, in batch_alter_table
    impl.flush()
  File "/Users/baileythegreen/Software/miniconda3/lib/python3.8/site-packages/alembic/operations/batch.py", line 101, in flush
    should_recreate = self._should_recreate()
  File "/Users/baileythegreen/Software/miniconda3/lib/python3.8/site-packages/alembic/operations/batch.py", line 94, in _should_recreate
    return self.operations.impl.requires_recreate_in_batch(self)
  File "/Users/baileythegreen/Software/miniconda3/lib/python3.8/site-packages/alembic/ddl/sqlite.py", line 56, in requires_recreate_in_batch
    and col.server_default.persisted
AttributeError: 'NoneType' object has no attribute 'persisted'

baileythegreen avatar Feb 08 '22 00:02 baileythegreen

I have no idea how this manages to get around the issue I saw before when nothing seemed to happen on downgrade.

baileythegreen avatar Feb 08 '22 00:02 baileythegreen

Does this seem like the right way to incorporate alembic with pyani? I know less abut the nuance of packaging things in Python.

https://github.com/sqlalchemy/alembic/issues/463#issuecomment-441887412

baileythegreen avatar Feb 08 '22 18:02 baileythegreen

I'm not familiar enough with alembic to know whether it's the "correct" way to proceed. It looks to have been part of alembic since 1.2.0, but isn't explicitly documented or in the tutoral AFAICT. So I presume it's stable, but maybe a niche operation that doesn't get stress-tested a lot.

What I get from the thread linked above seems sensible. If it does what we need, doesn't break anything else, and doesn't impose any outrageous requirements, that's good enough for me.

widdowquinn avatar Feb 09 '22 09:02 widdowquinn

I meant more from a Python packaging standpoint. All this seems to do is create two empty __init__.py files—one in the alembic (or other) directory specified as the location for alembic's stuff; the other in the versions subdirectory of the former. This seems unlikely to break anything, but may or may not be necessary.

I think the files that need to be included/given to users, and the correct structure for them is:

-  alembic.ini
-  alembic/
    - __init__.py
    - env.py
    - README           # contains a single line stating what alembic template was used; may not be useful
    - script.py.mako   # this may not be necessary; it is a template for creating new migration files
    - versions/
       - __init__.py
       - <revision_file>
       - <revision_file>

I have a subcommand working that will run upgrades and downgrades, and capture the output/errors from those. The main questions now are:

  • How should these files be included/given to the user?

    • Including them at the top level seems sensible to me, but for users who install via conda, I don't know if something else would be better.
  • Is there a way to set the sqlalchemy.url variable programmatically?

    • This is currently the only change that has to be made in the alembic.ini file in order for things to work, but it needs to match whatever database the user is using; for distribution we could use the default database name, but I would like to find a way to pull it from the command line arguments, or something. Ideally, without using sed, or a vimscript, to edit it.

baileythegreen avatar Feb 09 '22 15:02 baileythegreen

All this seems to do is create two empty __init__.py files—one in the alembic (or other) directory specified as the location for alembic's stuff; the other in the versions subdirectory of the former.

That doesn't seem unreasonable but, as I understand it, the __init__.py may no longer be necessary with namespace packages. As pyani isn't supported for Python versions that don't support namespace packaging, we'd likely be OK without it.

widdowquinn avatar Feb 09 '22 16:02 widdowquinn

How should these files be included/given to the user? Including them at the top level seems sensible to me, but for users who install via conda, I don't know if something else would be better.

I imagined that there would be a way to present only the pyani upgrade/pyani downgrade command to the user and have it work. Is that not possible?

widdowquinn avatar Feb 09 '22 16:02 widdowquinn

Is there a way to set the sqlalchemy.url variable programmatically?

I don't really know what this refers to, or should point to. I would have assumed the up/downgrade could run with a default --dbpath argument like the other commands, or take a passed path.

Is this issue relevant? https://github.com/sqlalchemy/alembic/issues/606

widdowquinn avatar Feb 09 '22 16:02 widdowquinn

For alembic to work, it needs a config file: alembic.ini, which contains the database name/driver info in the form of an sqlalchemy.url variable, and the location of the migration scripts as another variable. The sqlalchemy.url variable is pulled from the config file when alembic's env.py is run; I am not yet sure where the script location variable is retrieved. There does not seem to be a general way to pass these values inside an alembic command so that I can just pull them from our own command line options. It is possible to give alembic a database name this way, but I believe this only works in cases where the config file uses the multidb template, and allows you to selectively upgrade/downgrade a single database.

I imagined that there would be a way to present only the pyani upgrade/pyani downgrade command to the user and have it work. Is that not possible?

Right now, pyani versiondb [--upgrade <revision>] and pyani versiondb --downgrade <revision> work, but the location of the database, and the driver used, as well as the location of the migration scripts still have to be specified somewhere.

This is what I mean by specifying the driver & database location:

Valid SQLite URL forms are:
sqlite:///:memory: (or, sqlite://)
sqlite:///relative/path/to/file.db
sqlite:////absolute/path/to/file.db

The sqlite part should be fine, but the path is obviously something we can't predict, and the only way I currently know to get the value of something like --dbpath to the right bit of code is through an environment variable, which I would have to set, because I don't think alembic's env.py will have access to our namespace variables.

Is this issue relevant? sqlalchemy/alembic#606

This looks like they're using an environmental variable, so maybe. If you think that's an acceptable way forward. I still need to do some digging to see where script location stuff is used.

baileythegreen avatar Feb 09 '22 17:02 baileythegreen

Examples of my current alembic.ini and env.py. The only modifications I have made are to the sqlalchemy.url line(s) in the first one.

alembic.ini:

# A generic, single database configuration.

[alembic]
# path to migration scripts
script_location = alembic

# template used to generate migration files
# file_template = %%(rev)s_%%(slug)s

# sys.path path, will be prepended to sys.path if present.
# defaults to the current working directory.
prepend_sys_path = .

# timezone to use when rendering the date within the migration file
# as well as the filename.
# If specified, requires the python-dateutil library that can be
# installed by adding `alembic[tz]` to the pip requirements
# string value is passed to dateutil.tz.gettz()
# leave blank for localtime
# timezone =

# max length of characters to apply to the
# "slug" field
# truncate_slug_length = 40

# set to 'true' to run the environment during
# the 'revision' command, regardless of autogenerate
# revision_environment = false

# set to 'true' to allow .pyc and .pyo files without
# a source .py file to be detected as revisions in the
# versions/ directory
# sourceless = false

# version location specification; This defaults
# to alembic/versions.  When using multiple version
# directories, initial revisions must be specified with --version-path.
# The path separator used here should be the separator specified by "version_path_separator"
# version_locations = %(here)s/bar:%(here)s/bat:alembic/versions

# version path separator; As mentioned above, this is the character used to split
# version_locations. Valid values are:
#
# version_path_separator = :
# version_path_separator = ;
# version_path_separator = space
version_path_separator = os  # default: use os.pathsep

# the output encoding used when revision files
# are written from script.py.mako
# output_encoding = utf-8

# sqlalchemy.url = sqlite:////Users/baileythegreen/Software/pyani/scratch/alembic/old_db
sqlalchemy.url = sqlite:///pyani/scratch/alembic/old_db


[post_write_hooks]
# post_write_hooks defines scripts or Python functions that are run
# on newly generated revision scripts.  See the documentation for further
# detail and examples

# format using "black" - use the console_scripts runner, against the "black" entrypoint
# hooks = black
# black.type = console_scripts
# black.entrypoint = black
# black.options = -l 79 REVISION_SCRIPT_FILENAME

# Logging configuration
[loggers]
keys = root,sqlalchemy,alembic

[handlers]
keys = console

[formatters]
keys = generic

[logger_root]
level = WARN
handlers = console
qualname =

[logger_sqlalchemy]
level = WARN
handlers =
qualname = sqlalchemy.engine

[logger_alembic]
level = INFO
handlers =
qualname = alembic

[handler_console]
class = StreamHandler
args = (sys.stderr,)
level = NOTSET
formatter = generic

[formatter_generic]
format = %(levelname)-5.5s [%(name)s] %(message)s
datefmt = %H:%M:%S

and env.py:

from logging.config import fileConfig

from sqlalchemy import engine_from_config
from sqlalchemy import pool

from alembic import context

# this is the Alembic Config object, which provides
# access to the values within the .ini file in use.
config = context.config

# Interpret the config file for Python logging.
# This line sets up loggers basically.
fileConfig(config.config_file_name)

# add your model's MetaData object here
# for 'autogenerate' support
# from myapp import mymodel
# target_metadata = mymodel.Base.metadata
target_metadata = None

# other values from the config, defined by the needs of env.py,
# can be acquired:
# my_important_option = config.get_main_option("my_important_option")
# ... etc.


def run_migrations_offline():
    """Run migrations in 'offline' mode.
    This configures the context with just a URL
    and not an Engine, though an Engine is acceptable
    here as well.  By skipping the Engine creation
    we don't even need a DBAPI to be available.
    Calls to context.execute() here emit the given string to the
    script output.
    """
    url = config.get_main_option("sqlalchemy.url")
    context.configure(
        url=url,
        target_metadata=target_metadata,
        literal_binds=True,
        dialect_opts={"paramstyle": "named"},
    )

    with context.begin_transaction():
        context.run_migrations()


def run_migrations_online():
    """Run migrations in 'online' mode.
    In this scenario we need to create an Engine
    and associate a connection with the context.
    """
    connectable = engine_from_config(
        config.get_section(config.config_ini_section),
        prefix="sqlalchemy.",
        poolclass=pool.NullPool,
    )

    with connectable.connect() as connection:
        context.configure(
            connection=connection, target_metadata=target_metadata
        )

        with context.begin_transaction():
            context.run_migrations()


if context.is_offline_mode():
    run_migrations_offline()
else:
    run_migrations_online()

baileythegreen avatar Feb 09 '22 17:02 baileythegreen

This looks like they're using an environmental variable, so maybe. If you think that's an acceptable way forward. I still need to do some digging to see where script location stuff is used.

Environmental variables are fair game. os.environ["PYANI_DB_PATH"] could be set pretty straightforwardly post-CLI parsing, I think.

When populating, pathlib's Path should allow identification of the absolute path of whatever the CLI picks up for --dbpath on the current system so that you don't need to worry about relative paths in the sqlalchemy URL.

widdowquinn avatar Feb 09 '22 17:02 widdowquinn

Okay, good to know. I'll see if I can get that working. I have been waiting to push stuff because technically right now it won't run; without those values, it's going to throw an error.

baileythegreen avatar Feb 09 '22 17:02 baileythegreen

I'm looking forward to seeing how you manage the tests for this, too.

widdowquinn avatar Feb 09 '22 17:02 widdowquinn

That's what I'll be looking at after I try this stuff with environmental variables. I have ideas for what I'll do with tests.

baileythegreen avatar Feb 09 '22 17:02 baileythegreen

The issue described here has been resolved (and I felt really silly once it was). Skip to here if you want to bypass irrelevant questions/discussion.


I have thus far failed to get the environment variable plan working. I have managed to set them, and I have the syntax for retrieving them correct. However, I am unable to get their values assigned to the alembic config values, and I don't know why.

The env.py file seems to be a black box; I can't get print() statements to work, which makes it difficult to debug what's happening. If I don't have an alembic.ini file present in the directory, I get an error related to that; if I have one, but comment out the lines setting the two variables I need to modify, I get a different error; and whatever I supply there is not overwritten when I reset it in env.py.

An example of how I'm trying to reset them, where $DATABASE is the value of the --dbpath variable.

dbpath = os.environ.get('DATABASE')
    url = "sqlite:////Users/baileythegreen/Software/pyani/" + dbpath
    config.set_main_option("sqlalchemy.url", url)

From what I understand, this is the correct API method for what I want to do (eliminate the .ini file, but I have yet to get it to work.

baileythegreen avatar Feb 10 '22 03:02 baileythegreen

It's hard to comment on the first part without seeing the error message or practically taking a look.

In the second part, this line:

url = "sqlite:////Users/baileythegreen/Software/pyani/" + dbpath

looks suspect, to me. However, it's not clear what's in the DATABASE environment variable that gets read into dbpath.

The required format, described in an earlier post, is essentially:

sqlite:////absolute/path/to/file.db

It's not clear to me that your string, concatenated with dbpath, satisfies that. It looks like the hard-coded path to your pyani directory might also cause issues.

Have you tried:

  • using args.dbpath from the CLI, get the absolute path with args.dbpath.resolve()
  • setting a suitably-named environment variable with the result of args.dbpath.resolve() (and I'd suggest having a PYANI_ prefix so that related environment variables are effectively namespaced.
  • setting url = f"sqlite:///{os.environ.get('PYANI_DBPATH'}"

and what was/is the result?

widdowquinn avatar Feb 10 '22 08:02 widdowquinn

FEATURE REQUEST

alembic does upgrades/downgrades in-place. It would be useful, I think, to avoid borking an existing database by these updates, so we could backup the database before we start. For instance:

pyanidb

could be copied to

pyanidb.YYYY-MM-DD_HH-MM-SS.bak

before the in-place modification starts.

widdowquinn avatar Feb 10 '22 16:02 widdowquinn

To resummarise where this is (will potentially be expanded as I think of other things to add):

(Branch with code found here.)

  • The issues with editing env.py not working have been resolved; the problem was that an incorrect env.py was edited.
  • The pyani versiondb command creates an environment variable that are used to set values for the database name, overriding the value specified in the alembic.ini file. [^1]
  • An environment variable is also used for the script location, but this is set to alembic, rather than being user-determined. [^2]
  • The --package option creates two __init__.py files; one in the directory specified to alembic (default alembic/), one in the subdirectory alembic/versions/. [^3]
  • A step has been added to create a backup of the database being migrated, as described in the previous comment.
  • Additional options (--name, --config) have been created to allow for specifying a specific database name (in cases of a multidb alembic setup) or and config file not in the default location (./alembic.ini).

[^1]: This prevents the need to predict/dictate the database name. [^2]: This allows us to provide such a directory containing the necessary migration scripts as part of a pyani install. [^3]: I have decided not to use this option as it creates an issue for devs, wherein running import alembic in the parent directory of a project that contains an alembic/__init__.py file, attempts to import what it sees as a local module named alembic, rather than the actual database version control package.


CLI help information for pyani versiondb

! pyani versiondb -h
usage: pyani versiondb [-h] [-l LOGFILE] [-v] [--debug] [--disable_tqdm] [--version] [--citation]
                       [--dbpath DBPATH]
                       (--upgrade [VERSION] | --downgrade [VERSION] | --dry-run [START:END])
                       [--alembic_exe ALEMBIC_EXE] [-n NAME] [-c FILE]

One of --upgrade, --downgrade, or --dry-run must be specified.

optional arguments:
  -h, --help            show this help message and exit
  -l LOGFILE, --logfile LOGFILE
                        logfile location (default: None)
  -v, --verbose         report verbose progress to log (default: False)
  --debug               report debug messages to log (default: False)
  --disable_tqdm        Turn off tqdm progress bar (default: False)
  --version
  --citation
  --dbpath DBPATH       path to pyani database (default: .pyani/pyanidb)
  --upgrade [VERSION]   update an existing database to a newer schema; if no argument is given,
                        'head' will be used (default: None)
  --downgrade [VERSION]
                        revert an existing database to a older schema; if no argument is given,
                        'base' will be used (default: None)
  --dry-run [START:END]
                        produce the SQL that would be run in migrations, without altering the
                        database; start and end versions must be specified (default: None)
  --alembic_exe ALEMBIC_EXE
                        path to alembic executable (default: alembic)
  -n NAME, --name NAME  used to specify an individual database in a multidb setup (default: None)
  -c FILE, --config FILE
                        used to specify a config file for alembic (default: None)

baileythegreen avatar Feb 23 '22 17:02 baileythegreen

I have chased down the issue underlying something weird I was seeing, which somewhat derailed my writing tests for pyani versiondb.

Weird thing I noticed

If I took a copy of an old database (pre-fastANI changes to the ORM), my code was able to upgrade/downgrade this without issue; but if I took a new one (post-fastANI changes to the ORM), I get the following traceback:

Traceback (most recent call last):
  File "/Users/baileythegreen/Software/miniconda3/envs/pyani_dev/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1802, in _execute_context
    self.dialect.do_execute(
  File "/Users/baileythegreen/Software/miniconda3/envs/pyani_dev/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 732, in do_execute
    cursor.execute(statement, parameters)
sqlite3.OperationalError: error in table comparisons after drop column: no such column: kmersize
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/Users/baileythegreen/Software/miniconda3/envs/pyani_dev/bin/alembic", line 8, in <module>
    sys.exit(main())
  File "/Users/baileythegreen/Software/miniconda3/envs/pyani_dev/lib/python3.8/site-packages/alembic/config.py", line 588, in main
    CommandLine(prog=prog).main(argv=argv)
  File "/Users/baileythegreen/Software/miniconda3/envs/pyani_dev/lib/python3.8/site-packages/alembic/config.py", line 582, in main
    self.run_cmd(cfg, options)
  File "/Users/baileythegreen/Software/miniconda3/envs/pyani_dev/lib/python3.8/site-packages/alembic/config.py", line 559, in run_cmd
    fn(
  File "/Users/baileythegreen/Software/miniconda3/envs/pyani_dev/lib/python3.8/site-packages/alembic/command.py", line 366, in downgrade
    script.run_env()
  File "/Users/baileythegreen/Software/miniconda3/envs/pyani_dev/lib/python3.8/site-packages/alembic/script/base.py", line 563, in run_env
    util.load_python_file(self.dir, "env.py")
  File "/Users/baileythegreen/Software/miniconda3/envs/pyani_dev/lib/python3.8/site-packages/alembic/util/pyfiles.py", line 92, in load_python_file
    module = load_module_py(module_id, path)
  File "/Users/baileythegreen/Software/miniconda3/envs/pyani_dev/lib/python3.8/site-packages/alembic/util/pyfiles.py", line 108, in load_module_py
    spec.loader.exec_module(module)  # type: ignore
  File "<frozen importlib._bootstrap_external>", line 843, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "alembic/env.py", line 104, in <module>
    run_migrations_online()
  File "alembic/env.py", line 79, in run_migrations_online
    context.run_migrations()
  File "<string>", line 8, in run_migrations
  File "/Users/baileythegreen/Software/miniconda3/envs/pyani_dev/lib/python3.8/site-packages/alembic/runtime/environment.py", line 851, in run_migrations
    self.get_context().run_migrations(**kw)
  File "/Users/baileythegreen/Software/miniconda3/envs/pyani_dev/lib/python3.8/site-packages/alembic/runtime/migration.py", line 620, in run_migrations
    step.migration_fn(**kw)
  File "/Users/baileythegreen/Software/pyani/alembic/versions/92f7f6b1626e_add_fastani_columns.py", line 28, in downgrade
    op.drop_column("comparisons", "kmersize")
  File "<string>", line 8, in drop_column
  File "<string>", line 3, in drop_column
  File "/Users/baileythegreen/Software/miniconda3/envs/pyani_dev/lib/python3.8/site-packages/alembic/operations/ops.py", line 2189, in drop_column
    return operations.invoke(op)
  File "/Users/baileythegreen/Software/miniconda3/envs/pyani_dev/lib/python3.8/site-packages/alembic/operations/base.py", line 392, in invoke
    return fn(self, operation)
  File "/Users/baileythegreen/Software/miniconda3/envs/pyani_dev/lib/python3.8/site-packages/alembic/operations/toimpl.py", line 89, in drop_column
    operations.impl.drop_column(
  File "/Users/baileythegreen/Software/miniconda3/envs/pyani_dev/lib/python3.8/site-packages/alembic/ddl/impl.py", line 333, in drop_column
    self._exec(base.DropColumn(table_name, column, schema=schema))
  File "/Users/baileythegreen/Software/miniconda3/envs/pyani_dev/lib/python3.8/site-packages/alembic/ddl/impl.py", line 197, in _exec
    return conn.execute(construct, multiparams)
  File "/Users/baileythegreen/Software/miniconda3/envs/pyani_dev/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1289, in execute
    return meth(self, multiparams, params, _EMPTY_EXECUTION_OPTS)
  File "/Users/baileythegreen/Software/miniconda3/envs/pyani_dev/lib/python3.8/site-packages/sqlalchemy/sql/ddl.py", line 80, in _execute_on_connection
    return connection._execute_ddl(
  File "/Users/baileythegreen/Software/miniconda3/envs/pyani_dev/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1381, in _execute_ddl
    ret = self._execute_context(
  File "/Users/baileythegreen/Software/miniconda3/envs/pyani_dev/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1845, in _execute_context
    self._handle_dbapi_exception(
  File "/Users/baileythegreen/Software/miniconda3/envs/pyani_dev/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 2026, in _handle_dbapi_exception
    util.raise_(
  File "/Users/baileythegreen/Software/miniconda3/envs/pyani_dev/lib/python3.8/site-packages/sqlalchemy/util/compat.py", line 207, in raise_
    raise exception
  File "/Users/baileythegreen/Software/miniconda3/envs/pyani_dev/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1802, in _execute_context
    self.dialect.do_execute(
  File "/Users/baileythegreen/Software/miniconda3/envs/pyani_dev/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 732, in do_execute
    cursor.execute(statement, parameters)
sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) error in table comparisons after drop column: no such column: kmersize
[SQL: ALTER TABLE comparisons DROP COLUMN kmersize]
(Background on this error at: https://sqlalche.me/e/14/e3q8)

This puzzled me, because the column is clearly present in the database.

Upon further inspection, it seems to be a difference in the unique constraints specified in the updated ORM, versus those achieved by my migration script to update the old database.

Schema for comparisons table in old database

sqlite> .schema comparisons
CREATE TABLE comparisons (
	comparison_id INTEGER NOT NULL, 
	query_id INTEGER NOT NULL, 
	subject_id INTEGER NOT NULL, 
	aln_length INTEGER, 
	sim_errs INTEGER, 
	identity FLOAT, 
	cov_query FLOAT, 
	cov_subject FLOAT, 
	program VARCHAR, 
	version VARCHAR, 
	fragsize INTEGER, 
	maxmatch BOOLEAN, 
	PRIMARY KEY (comparison_id), 
	UNIQUE (query_id, subject_id, program, version, fragsize, maxmatch), 
	FOREIGN KEY(query_id) REFERENCES genomes (genome_id), 
	FOREIGN KEY(subject_id) REFERENCES genomes (genome_id)
);

Schema for comparisons table in old database following upgrade

sqlite> .schema comparisons
CREATE TABLE comparisons (
	comparison_id INTEGER NOT NULL, 
	query_id INTEGER NOT NULL, 
	subject_id INTEGER NOT NULL, 
	aln_length INTEGER, 
	sim_errs INTEGER, 
	identity FLOAT, 
	cov_query FLOAT, 
	cov_subject FLOAT, 
	program VARCHAR, 
	version VARCHAR, 
	fragsize INTEGER, 
	maxmatch BOOLEAN, kmersize INTEGER, minmatch FLOAT, nonsense INTEGER, 
	PRIMARY KEY (comparison_id), 
	UNIQUE (query_id, subject_id, program, version, fragsize, maxmatch), 
	FOREIGN KEY(query_id) REFERENCES genomes (genome_id), 
	FOREIGN KEY(subject_id) REFERENCES genomes (genome_id)
);

Schema for comparisons table in new database

CREATE TABLE comparisons (
	comparison_id INTEGER NOT NULL, 
	query_id INTEGER NOT NULL, 
	subject_id INTEGER NOT NULL, 
	aln_length INTEGER, 
	sim_errs INTEGER, 
	identity FLOAT, 
	cov_query FLOAT, 
	cov_subject FLOAT, 
	program VARCHAR, 
	version VARCHAR, 
	fragsize INTEGER, 
	maxmatch BOOLEAN, 
	kmersize INTEGER, 
	minmatch FLOAT, 
	PRIMARY KEY (comparison_id), 
	UNIQUE (query_id, subject_id, program, version, fragsize, maxmatch, kmersize, minmatch), 
	FOREIGN KEY(query_id) REFERENCES genomes (genome_id), 
	FOREIGN KEY(subject_id) REFERENCES genomes (genome_id)
);

baileythegreen avatar Mar 01 '22 01:03 baileythegreen

sigh

Apparently dropping unique constraints is one of the things SQLite doesn't like to do, so this migration just got a lot harder.

I have also been thinking about the utility of the downgrade option, and have been led to question it. If someone has a new database that contains any fastANI runs, removal of the kmersize and minmatch columns might mean some entries are non-unique, which is problematic. This could be handled by just not supporting downgrading; or we could throw a warning about this; or remove any fastANI entries (and notify the user).

We're creating a backup, so none of these would be catastrophic, but one may be preferable.

baileythegreen avatar Mar 01 '22 02:03 baileythegreen

The above issue (re: constraints) seems to be solved using the contextual version of the batch_alter_table() code mentioned earlier in this thread (https://github.com/widdowquinn/pyani/issues/378#issuecomment-1032081759). And the issue mentioned in that comment seems to have disappeared? I still don't know what that was about.

tl;dr; I have the database successfully upgrading and downgrading, new and old databases alike.

Now I just need to decide whether warnings are needed for downgrading, et cetera, and to finish my tests.

baileythegreen avatar Mar 01 '22 05:03 baileythegreen

I have also been thinking about the utility of the downgrade option, and have been led to question it. If someone has a new database that contains any fastANI runs, removal of the kmersize and minmatch columns might mean some entries are non-unique, which is problematic. This could be handled by just not supporting downgrading; or we could throw a warning about this; or remove any fastANI entries (and notify the user).

We're creating a backup, so none of these would be catastrophic, but one may be preferable.

My first thought is that we should aim to preserve the necessary metadata for a run, which includes - for fastANI - the kmersize/minmatch columns. If we cannot do so, we should loudly (with a warning) flag the problem to the user, and remove those runs from the database, with the argument that the backtransfer of fastANI runs is not compatible with the old schema.

I think we are justified as we backup, and do not destroy, the original, late-format database.

I would see downgrading the database from fastANI to not fastANI support as being an unusual operation, but that we might expect more up/downgrading with fitire use.

widdowquinn avatar Mar 01 '22 16:03 widdowquinn