discogs-xml2db
discogs-xml2db copied to clipboard
Put Postgres import and sql commands into a single script
Put the following into a script since you need to do all every time to properly import discogs data into the database. I run importcsv.dir with nohup, it takes some time so when I find it has completed I dont want then have to create the keys/indexes ectera and wait some more time.
Would need to check previous step has worked okay before moving to next step.
python3 postgresql/psql.py < postgresql/sql/CreateTables.sql python3 postgresql/importcsv.py /csvdir/* python3 postgresql/psql.py < postgresql/sql/CreatePrimaryKeys.sql python3 postgresql/psql.py < postgresql/sql/CreateFKConstraints.sql python3 postgresql/psql.py < postgresql/sql/CreateIndexes.sql
I have some concerns as to how having the indexes and keys present in the database before the import affects the speed of the import.
In most databases it tends to be slower, even in bulk load, because:
- every row has to be PK-checked, FK-checked, before insert; failure of a FK check would result in the entire import failing;
- having the FKs in place means that the files need to be loaded in a precise order, with the files providing the FK first, and the files using the FK last;
- after every row or row-threshold insert the indexes may have to be rebuilt, slowing the whole process down (rebuilding at the end is a single table scan)
By contrast, the current solution leaves a DB full of data at the end of the csv import; optimization (PK, FK, indexes) is then left at the latitude of the users.
All that aside, could you try the following two scenarios and see if the outcome is acceptable?
In a script:
Scenario A
- Run CreateTables.sql
- Import csv
- Run the rest of the sql scripts
Scenario B
- Run all sql scripts
- Import csv
Run each script of the scripts above with the time command and if the results from scenario B are positive and faster than A, I'd be happy to start looking into it.
Hi, if you look at my scripts you'll see indexes etc are created AFTER the import. Im just suggesting the user should be able to run one command that creates db tables, import and then creates indexes, keys and constraints, rather than having to run 5 different commands.
--- is then left at the latitude of the users. There maybe cases where they want to change this, but for most I cannot see why they would not want to run add keys and indexes after the db import, otherwise the db will be almost unusable.