sea-orm icon indicating copy to clipboard operation
sea-orm copied to clipboard

[PIP] Schema compatibility check

Open tyt2y3 opened this issue 2 years ago • 5 comments

No pun intended, PIP stands for "Proposal & Implementation Plan".

Motivation

To allow an application to check that all Entities are compatible with the current database schema, on CI and/or on application startup.

Architecture

We need to introduce a standardized way to define all entities, and I imagine:

trait EntityCollection: Iterable {
    fn entity_schema(&self, db: DbBackend) -> EntitySchema {
        // this will call `create_table_from_entity` etc to construct EntitySchema
    }
}

struct EntitySchema {
    name: DynIden,
    table: TableCreateStatement,
    enums: Vec<TypeCreateStatement>,
    indices: Vec<IndexCreateStatement>,
}

// user space

#[derive(EnumIter, DeriveEntityCollection)]
enum Entities {
    Cake,
    Fruit,
}

And as such, we can have the following API:

impl Schema {
    async fn check_compatibility<C: EntityCollection>(db: &DbConn, entities: &C) -> Result<(), SchemaErrors> {
        let mut errors = Vec::new();
        // discover schema
        for entity in entities::iter() {
            check(entity.entity_schema(db), db_schema);
        }
        if errors.is_empty() {
            Ok(())
        } else {
            Err(SchemaErrors { errors })
        }
    }
}

struct SchemaErrors {
    errors: Vec<SchemaError>,
}

enum SchemaError {
    TableMissing(..),
    ColumnMissing(..),
    ColumnTypeIncompatible(..),
    ..
}

Mode of operation

Under the hood, it will discover the entire schema from the database (just like codegen) and compare it against EntitySchema.

The tricky part is that these two schema will not be exactly the same, so we should be somewhat tolerant to permutations.

The live schema can be a superset of Entity schema, i.e. there can be more tables and columns. So we should start from EntitySchema and check that 1) column exist and 2) the types are inter-operable (i.e. DateTime can work with time::DateTime or chrono::DateTime, to complicate things a bit, numeric can also work with both Decimal and BigDecimal)

The MVP don't have to check absolutely everything, but just some sanity checks in CI would be better than nothing.

Since this depends on SeaSchema, we should place this implementation in sea-orm-migration.

And we need the support of the DeriveEntityCollection macro as well as codegen to generate the Entities enum in the first place.

A bonus would be that we can return all errors instead of failing immediately.

tyt2y3 avatar Oct 16 '22 14:10 tyt2y3

Hi! I was thinking recently about a similar thing. For a while now Django has a feature: whenever you change your model, you can simply run its CLI django-admin makemigrations and the thing will

  1. scan the existing entities (a.k.a. Models).
  2. run through the existing migrations, accumulating the changes in a "virtual schema".
  3. compare the resulting "virtual schema" with what you actually have in your models
  4. and generate a new migration capable of performing the necessary adjustments

All this is done without any connection to a database. Django does not run the migrations within a transaction to see what happens. It does not try to compare the actual tables with anything.

Just an example (self-explanatory, I hope). Let's say I have a simple model

class MyModel(models.Model):
    name = models.CharField(max_length=10)

After running makemigrations I get a file called 0001_initial.py with the following code:

class Migration(migrations.Migration):

    initial = True

    dependencies = [
    ]

    operations = [
        migrations.CreateModel(
            name='MyModel',
            fields=[
                ('id', models.BigAutoField(auto_created=True, primary_key=True, serialize=False, verbose_name='ID')),
                ('name', models.CharField(max_length=10)),
            ],
        ),
    ]

Now, if I change the model by adding a new field, like so:

class MyModel(models.Model):
    name = models.CharField(max_length=10)
    age = models.IntegerField(default=1)

and launch makemigrations, Django will generate a new file, this time called 0002_mymodel_age.py containing the changes:

class Migration(migrations.Migration):

    dependencies = [
        ('<app name here>', '0001_initial'),
    ]

    operations = [
        migrations.AddField(
            model_name='mymodel',
            name='age',
            field=models.IntegerField(default=1),
        ),
    ]

This is not only an incredibly useful feature, saving both time and keeping developer from making mistakes. It can also be used to validate the migrations against the existing entities (models). Exactly what you described earlier.

I believe SeaORM can perform the same trick with some changes.

Currently all migrations use SchemaManager to make the changes, and SchemaManager runs them directly on a live DB. More specifically, a Migration creates (mostly) DB-agnostic statements (like TableCreateStatement or IndexCreateStatement). Then SchemaManager turns those statements into executable SQL and runs it through a specified DB connection.

Instead, SchemaManager can be made more flexible, capable of "running" against a "virtual database". This "virtual DB" would receive the operators directly, analyze and "execute" them, changing its internal state accordingly. One will not even need to rewrite the existing migrations.

Statements themselves (I mean things like TableCreateStatement) should provide a public interface for accessing their configuration. Currently one can only conveniently construct them, but not inspect what was constructed. Everything is pub (crate).

Doing that will both make it possible to perform the compatibility check and later to automatically generate new migrations according to the changes made to the entities. All this could even be done as a separate library, but too low-level SchemaManager and mostly private XYZStatement structs keep that from happening.

kpot avatar Nov 29 '22 13:11 kpot

Thank you for you input, but it looks like a separate proposal indeed. I am aware of that Django feature, but it seems to be very complex to implement. There caveat is, if a user has a migration script in raw SQL, the feature couldn't work. And it's always ambiguous when a column is renamed.

tyt2y3 avatar Nov 30 '22 09:11 tyt2y3

Let me add an other perspective to this. Currently using sea-orm to generate entities it brings the problem, that the entity definitions mutate during development. But the migrations should not mutate, because they should be able to update an existing system, not only creating new ones. For example I can have a table in a first migration. But in the next migration, I want to add a reference to an other new table. Which is getting created within that second migration. Now you can create that second migration without a problem, but the first migration would now fail, since it would try to create a reference to a table that does not yet exist.

mohs8421 avatar Dec 08 '22 09:12 mohs8421

@mohs8421

For example I can have a table in a first migration. But in the next migration, I want to add a reference to an other new table. Which is getting created within that second migration. Now you can create that second migration without a problem, but the first migration would now fail, since it would try to create a reference to a table that does not yet exist.

I believe the first migration cannot fail in this case, since SeaORM maintains a special table called seaql_migrations which is used to track already applied migrations. So, in theory the first migration would not be executed again, unless we roll it back (down) first, which in the process will delete it from the seaql_migrations.

But you've raised a good point.

But the migrations should not mutate, because they should be able to update an existing system, not only creating new ones.

The migrations should not mutate. This is why I find a bit weird all current examples of migrations both in the docs and this repo, which look like these:

manager
    .create_table(
        sea_query::Table::create()
            .table(Post::Table)
            .if_not_exists()
            .col(
                ColumnDef::new(Post::Id)
                    .integer()
                    .not_null()
                    .auto_increment()
                    .primary_key()
            )
            .col(ColumnDef::new(Post::Title).string().not_null())
            .col(ColumnDef::new(Post::Text).string().not_null())
            .to_owned()
    )

or

manager
    .drop_table(
        sea_query::Table::drop()
            .table(Post::Table)
            .to_owned()
    )

All such examples always rely on using the entity structs directly (as in ColumnDef::new(Post::Title)). Which makes them prone to failures later. Some changes to the entities will cause implicit altering of all migrations present. For instance what's going to happen with the last example if I change the Post entity's table_name attribute, and then run sea-orm migrate down? It would fail, because the table in question is not there yet.

I would have gone as far as to require explicitly to have all column and table names within migrations as strings, perhaps even &'static str. Or, at least make the use of string names convenient, without the need to write something like

manager
    .create_table(
        sea_query::Table::create().table(TableRef::Table(Alias::new("my_table_name").into_iden()))

It would be great to have something like this instead:

manager
    .create_table(
        sea_query::Table::create(("schema", "my_table_name"))
        .if_not_exists()
        .col(
            ColumnDef::new("id")
                .integer()
                .not_null()
                .auto_increment()
                .primary_key()
         )

kpot avatar Dec 08 '22 10:12 kpot

I believe the first migration cannot fail in this case, since SeaORM maintains a special table called seaql_migrations which is used to track already applied migrations. So, in theory the first migration would not be executed again, unless we roll it back (down) first, which in the process will delete it from the seaql_migrations.

Maybe I didn't make that case clear enough. Yes the migration will run fine on a system where the first migration was already executed, but if that was not the case before, it will fail.

Currently I help myself with checks in the migration steps to look up whether something already exists and not do the change if that is the case. But this does only step around some issues. An other issue with that is for example adding foreign key constraints to a further table, which does not yet exist is not possible without disabling fk checks, which is why it would need to verify the tables existence before creating the constraint. Adding non-null columns or adding unique constraints may also lead to failed migrations, unless something is done with the data already in the db.

mohs8421 avatar Dec 08 '22 14:12 mohs8421