Pyrseas icon indicating copy to clipboard operation
Pyrseas copied to clipboard

feature request: execute sql in files before running the main yamltodb code

Open xclayl opened this issue 6 years ago • 23 comments

Some database changes yamltodb will never be able to handle. Take renaming a table, yamltodb, at best, will see it as a DROP TABLE, and a CREATE TABLE, which isn't correct.

Currently I've written some code to go through a folder and run all the SQL files. For example this one: 20181023 rename groups.sql:

DO $$ 
    BEGIN
        ALTER TABLE IF EXISTS app.group RENAME TO groups;
    END;
$$;

It can run multiple times, so you keep the file around for months until it's deployed to all the environments. You'd think production would be last, but from experience, there will be some testing environments that are usually the last to get them.

The files run in filename order, which is why it starts with a date.

It'd be great if Pyrseas handled this natively.

We've been doing this for 10 years with SQL Server with a tool similar to yours, so I'm happy to talk about the ins and outs.

I think the best thing to say about this is it is predictable for users of yamltodb.

xclayl avatar Oct 24 '18 20:10 xclayl

This is a known issue (see https://pyrseas.readthedocs.io/en/latest/issues.html?highlight=oldname#object-renaming), although you may not be aware that there is YAML syntax that can solve, in part, this issue. The problem is that the solution doesn't lead itself to storing the YAML specs in a VCS. It also requires manual intervention, e.g., the DBA/developer has to edit an existing YAML file and edit group to groups, and add oldname: group under it. However, that's not unlike other migration tools (like Sqitch) which require that you code the SQL ALTER TABLE like you have above, plus you have to add another ALTER TABLE to revert your change. Two years ago I suggested (https://pyrseas.wordpress.com/2016/06/09/the-future-of-pyrseas-part-1/) that at some point, I'd like to add some "specialized scripts or configuration files" to handle situations such as this. However, I've since put the overall project on sort of "life support" (see the later blog post), so although I appreciate your interest I can't see this getting done (unless someone comes up with a reasonable design).

jmafc avatar Oct 24 '18 22:10 jmafc

Admittedly there have been a few times where a roll-back script would have been handy. In the early days we weren't testing enough, and rolled back. Lately it's happened only once, related to the migration script not having enough permission in production (TCP ports -- something rare for SQL Server), compounded with people being on holiday. I'd hate to use it as the example of why you need rollback scripts. If you test enough, rollback isn't something you need.

I take the point that you end up having to write the change twice -- once as a migration script and once as the final version (in yaml).

But let me point out what is does give you.

It separates out

  • migration code
  • and the final schema.

After a few months, we delete the migration code and all we are left with is the yaml which describes only the schema as it stands today. Deletion is easy, b/c the files are dated and located in one folder (or two folders as there are some scripts are executed after yamltodb -- but this folder usually contains static data).

With a 7+ year old project, the database folder only contains the current structure. It's a very clean/organized folder. And there has been loads of DB migrations in that time.

During that time, there have been countless source control branches and merges, and it's easy to handle. You don't have to worry about a global change script order. Dates in the filenames are good enough.

Some migration changes yamltodb won't be able to handle generically:

  • Move columns to a new child table (initially with one child per parent)
  • Convert a bit column to a FK column to another table containing 3 rows (and you want to specify the PK column values for sorting reasons)
  • Split out a char column into two columns
  • Move some tables into a JSON field
  • and many more

You probably could handle each of these one-by-one, but

  • Would they collectively handle all scenarios?
  • Would your users know how to do these transformations?
  • Would your users want to learn the special yamltodb syntax when they already know the SQL to do it
  • Would the migration parameters (or script) live in source control forever?

Ideally, let yamltodb handle the predictable changes - adding a NULL column, changing a CHECK constraint, and let a folder of "pre" change scripts get the database "close enough" for yamltodb take it to the exact schema you want.

Try it out for one of your projects -- executing every *.sql file in a folder (in filename order) isn't very complex code to write. Once you do it for awhile, you'll see it solves lots of problems.

xclayl avatar Oct 25 '18 02:10 xclayl

That is more or less the concept of Sqitch and similar also to depesz Versioning (no longer maintained apparently). Sqitch, IIRC, uses a sequencing method a bit more sophisticated than dates (a control file?) and also depends on git. My intent (two years ago) was not to add all kinds of transformations to yamltodb but rather to provide some hooks whereby a stream of SQL could be merged with the output from yamltodb. Thanks to @dvarrazzo we now have a depends_on attribute for most DB objects, so in theory this could be extended to external scripts, e.g., depends_on: external:pre:some_script.sql to apply some_script.sql before any YAML-generated changes to a table. But I suspect that has the same or a similar problem to the oldname renaming method: once you apply a rename or an external script, the YAML file has to be edited to remove the oldname or the depends_on. And this is where the technique developed for dbaugment may come in handy: these extra attributes have to be in separate YAML files that are manually edited and version-controlled. Anyway, enough design-on-the-fly or pipe-dreaming for a night :-) Thanks for your feedback.

jmafc avatar Oct 25 '18 03:10 jmafc

OK, getting to your actual request, what could be done is something like the following: If processing multiple-files (although later this could be extended to the single-file case), look in the metadata directory (name is already configurable), for some other directory (pre-sql?) and execute any files therein, in some standard collation order. A question that comes to mind is whether to run the SQL in those files within the same transaction as the main yamltodb SQL stream (if -1 or -u were invoked). IIRC when we process COPY statements for loading "static" tables, we do incorporate them in the main stream. @dvarrazzo @rhunwicks If you have the time, your feedback would be appreciated.

jmafc avatar Oct 25 '18 12:10 jmafc

Static data sounds great -- Can I put these in CSV files? I don't see how to do this in the docs.

same transaction: For production, it depends on the size. If it's small, then yes. If it's big, then no. It can take a long time to process, and locks will be held on the tables, locking out real users. I could see this getting into deadlocks too. If it's huge, then probably yes (I'm not really sure). You'll add columns, etc, in a way that is fast, like https://brandur.org/postgres-default. So you'll be careful about the SQL you write.

I'd default to a single transaction for everything and have a flag to customize it if you care about this.

In development, I've been doing a single transaction per *.sql file for awhile now (years). I'd do this at a minimum. It's nice that it rolls back to a known state if there is an error (it's much harder to fix your database if it is partially migrated, especially if it's not your script). When there is an error then you open pgAdmin and start executing parts of the script to see what's going on.
If development, you'll use backup/restore (or DB templates) if the migration is complicated. You'll run and re-run the migration many times. This could impact whether you want one large transaction, b/c it may not be useful if you're already doing DB templates.

xclayl avatar Oct 25 '18 16:10 xclayl

Re: static data. This is discussed in the Overview under Supplementary Actions which also links to the Configuration Items page where it's discussed under Datacopy. And yes, CSV is the output and expected input format.

jmafc avatar Oct 25 '18 18:10 jmafc

One thing to add with the sql-pre folder, ideally it only runs if the database has tables. This is useful when setting up a test database that is empty initially -- there is no data to migrate. I've been using this query, but you may know a better way in Postgres.

SELECT COUNT(*) as cnt
  FROM information_schema.tables
 WHERE table_schema <>'information_schema' AND table_schema <>'pg_catalog'
   AND table_type='BASE TABLE';

xclayl avatar Oct 28 '18 11:10 xclayl

A Postgres-specific query would be:

SELECT COUNT(*) AS cnt
FROM pg_class c JOIN pg_namespace n ON relnamespace = n.oid
WHERE nspname NOT IN ('pg_catalog', 'information_schema')
      AND relkind = 'r';

However, the query may return zero, but the database may not be empty. There could be views, foreign tables or other objects (functions, domains, types, etc.).

I have a question regarding the scripts (I'll be expecting them in a preactions subdirectory of the metadata directory--both being configurable). Consider a very simple scenario where you have a table named foo in the database and one of the scripts renames it to bar. How do you prevent a second run trying the rename again (and of course, failing)?

IIRC the Versioning approach suggested by depesz creates a schema and a table to hold an identifier (like the name of the script file) for the last applied change, so it will not attempt to redo a change. I believe Sqitch uses a similar or perhaps more sophisticated approach because it's also tied in to a Git repo. So how do you suggest that Pyrseas prevent a second run? Or would you just always expect an error on subsequent runs, which could be safely ignored? I don't think the latter is a very good approach although I can't think of a specific bad scenario except one involving a dumb DBA loading data into a table without a primary key.

jmafc avatar Nov 02 '18 04:11 jmafc

The query to look for an empty database is alright. If there are no tables (but views, etc), there's nothing to migrate, so the pre-sql scripts don't need to run.

The scripts need to be able to re-run, but later executions, nothing happens. This is actually much easier in Postgres compared to SQL Server.

Here's a sample script that can re-run perfectly fine:

DO $$ 
    BEGIN
        ALTER TABLE app.group ADD COLUMN IF NOT EXISTS title character varying(10000) NULL;
    END;
$$;

DO $$ 
    BEGIN
        UPDATE app.group
        SET title = 'unnamed'
        WHERE title IS NULL;
    END;
$$;


DO $$ 
    BEGIN
        ALTER TABLE app.group ALTER COLUMN title SET NOT NULL;
    END;
$$;

No need to worry about tracking if the file ran or not.

xclayl avatar Nov 03 '18 00:11 xclayl

The technical term being that the scripts need to be idempotent, so that's something that will have to be emphasized in the user documentation. BTW, you can do that all in one shot, i.e., ALTER TABLE app.group ADD COLUMN IF NOT EXISTS title character varying(10000) DEFAULT 'unnamed' NOT NULL;

jmafc avatar Nov 03 '18 00:11 jmafc

I've tried to devote some of the scant time that I devote to Pyrseas nowadays and it's like swimming upstream. Part of the problem is lack of interest, but also the problem definition is still very amorphous. To add support for a new PG statement like CREATE PROCEDURE there's concrete documented syntax and catalog changes which drive the process, making it easier to use a TDD approach. Even for the datacopy feature, I knew we had to support COPY FROM and COPY TO and the main task became how to invoke \copy so that a regular user could use the feature.

With this request, I know we'll have various files in a subdirectory of the metadata directory, to be named preactions but the name will be configurable. I also know the files should presumably contain SQL statements, but other than the example of ALTER TABLE t1 RENAME TO t2 (and possibly also a column rename), I have no other test cases. I can of course add bad SQL as a test case, but that's just a test to verify some kind of exception is raised.

The code changes will probably go into the following function: https://github.com/perseas/Pyrseas/blob/e082b0ad4aad8fdafb504bb20e9c0d95a2d3c92b/pyrseas/database.py#L399 which only gets called when the --multiple-files option is used, or it would have to be called before map_from_dir is called. It's also unclear whether the stream of SQL statements will be incorporated to the output stream of yamltodb or whether the files should be invoked separately and if so, whether to execute them via Psycopg or to spawn a psql process.

Your feedback would hopefully help to clarify some of these questions.

jmafc avatar Nov 08 '18 05:11 jmafc

One scenario is to split a table into two: parent and child. Columns A & B stay with the parent and columns C & D move to a new child table.
Another (harder) example is to combine two tables into one by putting the child data in a json field of the parent.
I still like my example of adding a not null column with a value (like 'migrated'). I want new records to specify the value rather than populated with a default 'migrated' value. Alternatively, this new column could be a record type column where the value is 'legacy' for existing records and the default value going forward is 'new'.

I'd go through psycopg so that the single-transaction flag includes these files.

If you allow multiple transactions, I'd have each file be in a transaction rather than no transactions.

Output should indicate where the transactions begin/end (like it does now). I'd just output the filename, not the file contents. These run everytime so it's too verbose to output the contents. Output the filename before so that we know which file failed. Hopefully the error contains the line number to know which statement failed.

Also, only run the preactions if there are tables already in the DB.

Postactions would be nice too. I know of concrete examples in SQL Server however I don't know postgres well enough to have concrete examples.

Thanks for adding this.

xclayl avatar Nov 08 '18 07:11 xclayl

In terms of design, one way to make this consistent with single-files and to have preactions/postactions nodes like:

schema myschema:
  owner: postgres
  ...
postactions 20181108_system_user: |2
		DO
		$body$
		BEGIN
			IF NOT EXISTS (SELECT 1 FROM myschema.person WHERE person_id = 0) THEN	  
			  INSERT INTO myschema.person(person_id, first_name, last_name)
				VALUES (0, 'sys', 'user');
			END IF;
		END;
		$body$
		LANGUAGE 'plpgsql'; 

This may also be a concrete postactions example.

xclayl avatar Nov 08 '18 07:11 xclayl

If we execute everything via Psycopg, there will be some limitations on what can be written in the SQL scripts (psql for example supports variables and DO statements--I don't know if Psycopg supports DO), but lets go with this simpler scenario for now.

Here is a test implementing your suggested first case:

from pyrseas.testutils import InputMapToSqlTestCase

CREATE_STMT = "CREATE TABLE sd.t1 (c1 integer, c2 text, c3 integer, c4 date)"

class ExecuteActionsToSqlTestCase(InputMapToSqlTestCase):
    """Test execution of externally-specified SQL statements"""

    def test_split_table(self):
        "Split a four-column table into two one column tables"
        inmap = self.std_map()
        inmap['schema sd'].update({'table t1': {
            'columns': [{'c1': {'type': 'integer'}},
                        {'c2': {'type': 'text'}}]},
                                   'table t2': {
            'columns': [{'c3': {'type': 'integer'}},
                        {'c4': {'type': 'date'}}]}})
        # create SQL-actions file(s)
        sql = self.to_sql(inmap, [CREATE_STMT])
        assert sql == some_stmts

InputMapToSqlTestCase is a helper class that simulates the yamltodb interface. The to_sql method takes a YAML map (in this case, a Python dict) and a list of statements (CREATE_STMT). The latter is executed before the yamltodb phase, i.e., it sets up tables, etc., needed for the test (the helper class first forces the test database to be "empty", the default sd schema is created as well). The std_map method returns an equally "empty" dict.

The some_stmts is a placeholder for what we expect yamltodb to produce. In this case, this could be something like the following:

CREATE TABLE IF NOT EXISTS t2 (c3 integer, c4 date);
ALTER TABLE t1 DROP COLUMN c3;
ALTER TABLE t1 DROP COLUMN c4;

Of course, we could add a more realistic TRUNCATE t2; INSERT INTO t2 SELECT c3, c4 FROM t1; if desired. Anyway, the SQL text has to be added to a temp file that InputMapToSqlTestCase will cause to be read as it were yamltodb. An alternative would be to specify the statements as a config file, as we do in test_static.py. BTW, I was wrong in assuming the feature would be limited to --multiple-files invocations only. We handle static data copying even for single file invocations. It's just that the files have to be placed somewhere and the default location is under the metadata directory.

Once we agree on the process for this simple test and I add the code so that it succeeds, we can discuss more complicated scenarios, including reading multiple scripts in correct order, checking for empty database, etc.

jmafc avatar Nov 09 '18 05:11 jmafc

I'd say DO statements are required -- it'd be useless without it. You need variables, control flow, etc to do anything slightly complicated. I'm optimistic that psycopg supports DO, b/c it supports cursors which are also apart of plpgsql.

For a more realistic example, c1 would be the primary key of the parent table, and you'll copy c1 to t2 as well. t2 will have its own PK too. The INSERT INTO would be in the custom preaction script(s) so I wouldn't expect pyrseas to produce TRUNCATE t2; INSERT INTO t2 SELECT c3, c4 FROM t1;.

I don't know python well enough to talk about config files. I guess the only thing I'd say is, as a user, I shouldn't have to use a config file (ie it's optional). The existence of the folder and files is enough.

correct order: probably python string sorting. Consistency is the most important. Ideally it would be the same as the filesystem, but I think linux and windows differ in the way the sort filenames (ie file.txt & file2.txt, which is first?). empty database: checking for tables is enough. Feel free to be more thorough. If the output explains what happened, it's fine, ie "no tables found, skipping preactions". Or "empty database found, skipping preactions". I can't imagine a real database without tables (I've never personally seen this), so to me checking tables and checking for any DB objects are equivalent.

xclayl avatar Nov 09 '18 08:11 xclayl

The CREATE TABLE, ALTERs, TRUNCATE and INSERT are indeed intended to represent what the user would put in the pre-action files. This is the very first simple example, so we don't have to make it very realistic and that's why I didn't add PKs, etc. As it stands right now, the to_sql call produces the following:

['CREATE TABLE sd.t2 (\n    c3 integer,\n    c4 date)',
 'ALTER TABLE sd.t1 DROP COLUMN c3',
 'ALTER TABLE sd.t1 DROP COLUMN c4']

But the idea is that similar looking statements, in theory taken from a file--but in practice the test itself creates that file, will be read and executed prior to invoking Database.diff_map (the main function of yamltodb), and here's the crucial principle: they have to be executed in the real target database, because otherwise diff_map will spit out overlapping statements. IOW diff_map cannot parse and apply the pre-action statements to the internal representation of the target that it gets from from_catalog, it can only diff that internal representation against the input map (inmap) provided to it.

jmafc avatar Nov 09 '18 14:11 jmafc

Right I see. I agree. Preactions should be executed on the target DB and before from_catalog is called.
Sounds like it could have an impact on the code that begins the transaction.

xclayl avatar Nov 09 '18 19:11 xclayl

For further research: one possibility for dealing with "never able to handle" changes (at least some of them), may be something analogous to dbaugment, that is, something like a separate YAML specification file that can be merged with the regular YAML file. For example, suppose we have a table named 'movies' in the regular YAML. Currently, you can manually edit the file, replace 'movies' by 'film', adding an oldname: movies under it, and yamltodb will generate an ALTER TABLE movies RENAME TO film, but you can't save the manually edited file under a VCS, or re-use it against the same database because then it will re-generate that statement each time. If the annotations needed to generate the statement were in a separate YAML file that would be merged with the regular file, this problem could be avoided. This separate file could also contain pre-action and post-action sections either at the table/object level and/or at the schema/database level.

jmafc avatar Sep 11 '20 00:09 jmafc

Sounds more complicated than running SQL in filename order. Just my 2 cents.

xclayl avatar Sep 20 '20 19:09 xclayl

The above was mostly a note to self, in the case I ever decide to pick up development of Pyrseas (or rather its successor in another language). The advantage of a secondary YAML file is that the changes would not have to be expressed in SQL DDL.

jmafc avatar Sep 20 '20 22:09 jmafc

Which language would you be looking to use? I don't know Python so it was a blocker to submitting pull requests. SQL is what your users know (and want to learn/invest in), so it's surprising to hear this. In fact (taking it a step further), someone like me would rather write CREATE TABLE statements over yaml. It's what I know and the SQL knowledge can be used on other projects, even other other DBs. Granted, if I were writing Pyrseas, I'd find that difficult to parse/code.

xclayl avatar Sep 20 '20 22:09 xclayl

Pyrseas did not come about because I didn't like to write SQL. It was because if I had a C struct named Movie, renamed it Film and added another struct member release_date, I could merge the change to another branch in practically any VCS without any trouble (barring conflicting changes from somebody else). If I did the same "edits" on a copy of pg_dump output, it was useless: I could perhaps store it in a VCS, and diff version 1 against version 42, but neither Postgres nor any other relational database could take my edited SQL and modify a target database to agree with it. There was a project named apgdiff (actually it's still in GH), that compared two pg_dump outputs and generated SQL DDL changes. But that involves parsing SQL (using Java), which has a number of problems I won't get into. Then there was PostFacto (written in Python) that connected to two live PG databases, queried the catalogs and generated DDL from those. And finally, there was Andromeda (written in PHP) which had the developer/DBA maintain/edit a YAML file describing a PG database (only tables/columns and maybe functions), then generated SQL DDL. I liked the latter approach best but rather than having someone edit a YAML file by hand, I wrote dbtoyaml to create it for them. As for an implementation language for a Pyrseas successor, if it ever happens, it could be modern C++.

jmafc avatar Sep 20 '20 23:09 jmafc

It’s coming back to me now. I had forgotten the idea is to use dbtoyaml rather than edit the yaml directly. I don’t know C++ either. Probably for the best. I have the impression that we have incompatible visions for this.

xclayl avatar Sep 20 '20 23:09 xclayl