sqlite-utils icon indicating copy to clipboard operation
sqlite-utils copied to clipboard

Ability to merge databases and tables

Open sgraaf opened this issue 3 years ago • 7 comments

Hi! Let me firstly say that I am a big fan of your work -- I follow your tweets and blog posts with great interest 😄.

Now onto the matter at hand: I think it would be great if sqlite-utils included a merge or combine command, with the purpose of combining different SQLite databases into a single SQLite database. This way, the newly "merged" database would contain all differently named tables contained in the databases to be merged as-is, as well a concatenation of all tables of the same name.

This could look something like this:

sqlite-utils merge cats.db dogs.db > animals.db

I imagine this is rather straightforward if all databases involved in the merge contain differently named tables (i.e. no chance of conflicts), but things get slightly more complicated if two or more of the databases to be merged contain tables with the same name. Not only do you have to "do something" with the primary key(s), but these tables could also simply have different schemas (and therefore be incompatible for concatenation to begin with).

Anyhow, I would love your thoughts on this, and, if you are open to it, work together on the design and implementation!

sgraaf avatar Sep 23 '22 11:09 sgraaf

Instead of outputting binary data to stdout the interface might be better like this

sqlite-utils merge animals.db cats.db dogs.db

similar to zip, ogr2ogr, etc

Actually I think this might already be possible within ogr2ogr. I don't believe spatial data is a requirement though it might add an ogc_id column or something

cp cats.db animals.db
ogr2ogr -append animals.db dogs.db
ogr2ogr -append animals.db another.db

chapmanjacobd avatar Sep 24 '22 04:09 chapmanjacobd

This is a really interesting idea.

I'm nervous about needing to set the rules for how duplicate tables should be merged though. This feels like a complex topic - one where there isn't necessarily an obviously "correct" way of doing it, but where different problems that people are solving might need different merging approaches.

Likewise, merging isn't just a database-to-database thing at that point - I could see a need for merging two tables using similar rules to those used for merging two databases.

So I think I'd want to have some good concrete use-cases in mind before trying to design how something like this should work. Will leave this thread open for people to drop those in!

simonw avatar Sep 26 '22 18:09 simonw

This is also the kind of feature that would need to express itself in both the Python library and the CLI utility.

simonw avatar Sep 26 '22 18:09 simonw

This might be fairly straightforward using SQLite's backup utility: https://docs.python.org/3/library/sqlite3.html#sqlite3.Connection.backup

eyeseast avatar Sep 26 '22 19:09 eyeseast

That feels like a slightly different command to me - maybe sqlite-utils backup data.db data-backup.db? It doesn't have any of the mechanics for merging tables together. Could be a useful feature separately though.

simonw avatar Sep 26 '22 22:09 simonw

Right. The backup command will copy tables completely, but in the case of conflicting table names, the destination gets overwritten silently. That might not be what you want here.

eyeseast avatar Sep 26 '22 22:09 eyeseast

some good concrete use-cases in mind

I actually found myself wanting something like this the past couple days. The use-case was databases with slightly different schema but same table names.

here is a full script:

import argparse
from pathlib import Path

from sqlite_utils import Database


def connect(args, conn=None, **kwargs) -> Database:
    db = Database(conn or args.database, **kwargs)
    with db.conn:
        db.conn.execute("PRAGMA main.cache_size = 8000")
    return db


def parse_args() -> argparse.Namespace:
    parser = argparse.ArgumentParser()
    parser.add_argument("database")
    parser.add_argument("dbs_folder")
    parser.add_argument("--db", "-db", help=argparse.SUPPRESS)
    parser.add_argument("--verbose", "-v", action="count", default=0)
    args = parser.parse_args()

    if args.db:
        args.database = args.db
    Path(args.database).touch()
    args.db = connect(args)

    return args


def merge_db(args, source_db):
    source_db = str(Path(source_db).resolve())

    s_db = connect(argparse.Namespace(database=source_db, verbose = args.verbose))
    for table in s_db.table_names():
        data = s_db[table].rows
        args.db[table].insert_all(data, alter=True, replace=True)

    args.db.conn.commit()


def merge_directory():
    args = parse_args()
    source_dbs = list(Path(args.dbs_folder).glob('*.db'))
    for s_db in source_dbs:
        merge_db(args, s_db)


if __name__ == '__main__':
    merge_directory()

edit: I've made some improvements to this and put it on PyPI:

$ pip install xklb
$ lb merge-db -h
usage: library merge-dbs DEST_DB SOURCE_DB ... [--only-target-columns] [--only-new-rows] [--upsert] [--pk PK ...] [--table TABLE ...]

    Merge-DBs will insert new rows from source dbs to target db, table by table. If primary key(s) are provided,
    and there is an existing row with the same PK, the default action is to delete the existing row and insert the new row
    replacing all existing fields.

    Upsert mode will update matching PK rows such that if a source row has a NULL field and
    the destination row has a value then the value will be preserved instead of changed to the source row's NULL value.

    Ignore mode (--only-new-rows) will insert only rows which don't already exist in the destination db

    Test first by using temp databases as the destination db.
    Try out different modes / flags until you are satisfied with the behavior of the program

        library merge-dbs --pk path (mktemp --suffix .db) tv.db movies.db

    Merge database data and tables

        library merge-dbs --upsert --pk path video.db tv.db movies.db
        library merge-dbs --only-target-columns --only-new-rows --table media,playlists --pk path audio-fts.db audio.db

        library merge-dbs --pk id --only-tables subreddits reddit/81_New_Music.db audio.db
        library merge-dbs --only-new-rows --pk subreddit,path --only-tables reddit_posts reddit/81_New_Music.db audio.db -v

positional arguments:
  database
  source_dbs

Also if you want to dedupe a table based on a "business key" which isn't explicitly your primary key(s) you can run this:

$ lb dedupe-db -h
usage: library dedupe-dbs DATABASE TABLE --bk BUSINESS_KEYS [--pk PRIMARY_KEYS] [--only-columns COLUMNS]

    Dedupe your database (not to be confused with the dedupe subcommand)

    It should not need to be said but *backup* your database before trying this tool!

    Dedupe-DB will help remove duplicate rows based on non-primary-key business keys

        library dedupe-db ./video.db media --bk path

    If --primary-keys is not provided table metadata primary keys will be used
    If --only-columns is not provided all non-primary and non-business key columns will be upserted

positional arguments:
  database
  table

options:
  -h, --help            show this help message and exit
  --skip-0
  --only-columns ONLY_COLUMNS
                        Comma separated column names to upsert
  --primary-keys PRIMARY_KEYS, --pk PRIMARY_KEYS
                        Comma separated primary keys
  --business-keys BUSINESS_KEYS, --bk BUSINESS_KEYS
                        Comma separated business keys

chapmanjacobd avatar Oct 01 '22 03:10 chapmanjacobd