Store migration hashes to detect accidental modifications

Open czocher opened this issue 2 years ago • 1 comments

Hello @cljoly, Most database migration libraries store a hash sum of a given migration in the database to be able to detect migration modifications. From my experience this feature often comes in very useful due to accidental changes performed by distracted programmers, which tend to slip through review.

In our implementation, since we only store the 32-bit migration id in the user_version PRAGMA field, we won't be able to duplicate the same mechanism using this storage technique.

My idea for a backwards-compatible potential solution is to create our own migrations table, containing just two columns:

id - the migration id,
hashsum - a sha256 hash of the migration content (up + down)

This feature can be additive and feature-flagged, since not every user may require it (although it would be strongly advised for the extra consistency benefits).

This table would be used only to compare the hashes, not to verify the current migration - this would still be done with the user_version field. The hash comparison could be either implicit (done when a migration is requested) or explicit (performed with a new method).

Other solutions I thought about were:

Storing the hashes not in the database, but in a separate file instead. If someone removes that file, we just regenerate it and assume the migrations weren't modified.
Storing only the hash of hashes to verify if any migration changed without being able to tell which one.
Using 16 bits from the user_version for the migration id and the other 16 bits for a combined hash of hashes of all the transactions (using e.g. Fletcher-16). This would allow us to find accidental modifications, but not point to the migration that was modified. It would still allow us to count 2¹⁶-2 migrations. I rejected this idea due to risks of collisions on 16 bits, which would limit the usefulness of this feature.
Using bloom filters instead of a migration counter. This of course would be a breaking change. I rejected this idea because the risk of collision on a 32 bit bloom filter grows very quickly. After just around 10 migrations there would be a 1 in 5 chance of a false positive.
Using some kind of 16 bit CRC which would find 1 error, but not on the bit level, but on the hash level?
Using the hashes as parameters of a polynomial and finding a zero point? - May not work at all, but it's something that would be interesting conceptually.

Jun 20 '23 04:06 czocher

Sorry, I thought I answered, but apparently I didn't.

In my mind, all use cases of this library are with an application (a cli or gui tool, a server...) calling the Sqlite library to apply migrations and read / write some data. If the tool is called repeatedly (like git), startup time is critical and needs to be as low as possible. A benefit of this library is that you just read a number at a fixed offset in the SQLite file and you know the state of the migrations. Parsing some more tables is needlessly slow in that context.

This library also tries to be easy to understand and to get out of the way. Of course, we could add another optional feature, but each added feature makes it harder to grasp what the library does and how to use it. In a way, that limit the total number of features we could potentially have and I would rather keep that finite number for more meaningful things.

If you really need to triple check that migrations were not changed, it seems to me that you would pick another library in the first place, for instance because you also need generic support for Postgres and SQLite.

As a side note, I believe that it is useful to be able to change the schema while keeping the meaning the same. For instance, one could make formatting more consistent across multiple migrations or change a comment to explain the context around some SQL code after it has been applied.

Finally, I think that this feature is more useful with databases using a server, where someone (or a CI/CD script) will apply the migration, potentially separately from running the actual application. Then, there is the potential for that separate process to apply the schema before it gets changed and would need to be applied again. For us, migrations are applied typically when you run a new version of the application/server for the first time, so it's much closer to the actual use of the data.

Thanks for the suggestion and for describing all those various option (I would add one more, to have a unit test check the migrations but an integration with insta would be more valuable #112). However my opinion for now is that it's not a feature I would like to see implemented, even optimally, due to the added complexity and the limited usefulness.

[ ] It may be worth calling out in the documentation more explicitly that we are not checking for changes made to the schema.

Dec 05 '23 23:12 cljoly