python-mapswipe-workers
python-mapswipe-workers copied to clipboard
Adjust postgres DB schema and reduce storage needed
There is a lot potential to improve the database schema in order to reduce storage needs / disk size. This will help to reduce costs. At the moment database size mainly depends on two tables results
and tasks
. Also the indexes take up a ton of space as well. (See overview below.)
schema_name | relname | size | table_size
-------------+----------------------------+---------+--------------
public | results | 124 GB | 133230313472
public | results_pkey | 94 GB | 100788797440
public | tasks | 68 GB | 73426419712
public | results_userid | 63 GB | 67274162176
public | results_projectid | 50 GB | 53498740736
public | results_taskid | 49 GB | 52157546496
public | results_groupid | 32 GB | 33935441920
public | results_timestamp_date_idx | 31 GB | 33730682880
public | tasks_pkey | 19 GB | 20713758720
public | tasks_task_id | 11 GB | 11902173184
public | tasks_projectid | 7699 MB | 8073166848
public | tasks_groupid | 5203 MB | 5455806464
public | groups | 116 MB | 121364480
public | groups_pkey | 61 MB | 64421888
public | groups_goupid | 37 MB | 38715392
public | groups_projectid | 21 MB | 21667840
public | users | 5328 kB | 5455872
public | spatial_ref_sys | 4584 kB | 4694016
public | users_pkey | 4472 kB | 4579328
public | users_userid | 4376 kB | 4481024
public | projects | 2128 kB | 2179072
public | results_temp | 1176 kB | 1204224
public | spatial_ref_sys_pkey | 192 kB | 196608
public | projects_pkey | 72 kB | 73728
The current structure of the results table contains highly redundant information. project_id
, group_id
and user_id
, timestamp
, start_time
, end_time
are stored for all tasks of a group although they are identical for all tasks.
CREATE TABLE IF NOT EXISTS results (
project_id varchar,
group_id varchar,
user_id varchar,
task_id varchar,
"timestamp" timestamp,
start_time timestamp DEFAULT NULL,
end_time timestamp DEFAULT NULL,
result int,
PRIMARY KEY (project_id, group_id, task_id, user_id),
FOREIGN KEY (project_id) REFERENCES projects (project_id),
FOREIGN KEY (project_id, group_id) REFERENCES groups (project_id, group_id),
FOREIGN KEY (project_id, group_id, task_id) REFERENCES tasks (project_id, group_id, task_id),
FOREIGN KEY (user_id) REFERENCES users (user_id)
);
We could introduce another table called group_results
. This would also be closer to the structure we already use in Firebase. With this table we could also do a lot of things for which we don't need the actual result value. E.g. we could calculate the overall progress of a group, the number of users over time, the total number of results etc. And this table would be ~100 times smaller, assuming that a group contains about 100 tasks. This could somehow like this.
CREATE TABLE IF NOT EXISTS group_results (
result_id int,
project_id int,
group_id int,
user_id int,
"timestamp" timestamp,
session_length int
result_count int,
PRIMARY KEY (result_id),
FOREIGN KEY (project_id) REFERENCES projects (project_id),
FOREIGN KEY (project_id, group_id) REFERENCES groups (project_id, group_id),
FOREIGN KEY (project_id, group_id, task_id) REFERENCES tasks (project_id, group_id, task_id),
FOREIGN KEY (user_id) REFERENCES users (user_id)
);
CREATE TABLE IF NOT EXISTS results (
result_id int,
task_id int,
result int,
PRIMARY KEY (result_id, task_id),
FOREIGN KEY (result_id) REFERENCES group_results (result_id)
);
Finally, it might be beneficial to use table partitioning. This could improve performance for queries that need to access data only for specific projects. This is a very common task for us as we usually extract and process results on a project basis.
This link has some interesting comments on saving space
- [ ] write down how we want to query the database
Proposed steps to address this issues:
- [x] 0. Build test dataset -> dev database is test dataset
- [x] 1. adjust init_db: all tables get new integer "postgres"id, the other "firebase"id should optimally be only save once here, but should never be referenced directly
- [x] 2. check if create Project on workers is affected by new columns (project_id, prob not, but group_id)
- [x] 3. in transfer_results/save_results_to_postgres in lowest query: adjust to new schema with a join (probably)
- [ ] 4. overhaul export module to fit new data schema
- [ ] 5. write Script which transforms old schema to new schema in new db
maby test if indices on multiple columns improve performance, e.g.:
CREATE INDEX tasks_idx ON tasks using btree (project_id, group_id, ?task_id?);
-> queries should mostly involve all the id columns, right? MultiColumn indices are also usable for only one of the columns, but the leftmost column is always the best one
2end meeting notes:
- [x] leave taskid as varchar, no seperate firebase from postgres id in this case
- [x] -> this means we dont need to join on task in insert results but instead on groups
- [x] remove extra indices -> see table below, mostly multicolumn indexes (the pkeys) are used, rest is not necessary
relid | indexrelid | schemaname | relname | indexrelname | idx_scan | idx_tup_read | idx_tup_fetch
-------+------------+------------+-----------------+----------------------------+------------+--------------+---------------
18015 | 18021 | public | results | results_pkey | 1337110083 | 85961952477 | 84673455153
17971 | 17977 | public | groups | groups_pkey | 789246169 | 863869261 | 863869261
17963 | 17969 | public | projects | projects_pkey | 786261209 | 791482289 | 787827285
17986 | 17992 | public | tasks | tasks_pkey | 749112873 | 750198518 | 750198518
18007 | 554848 | public | users | users_userid | 732340918 | 732340918 | 732333934
18015 | 18043 | public | results | results_projectid | 114568 | 59548210026 | 58688639078
18007 | 18013 | public | users | users_pkey | 12547 | 0 | 0
18015 | 549882 | public | results | results_timestamp_date_idx | 2199 | 6276491604 | 5728927495
18015 | 18046 | public | results | results_userid | 1486 | 87994290 | 10731281
17986 | 18006 | public | tasks | tasks_projectid | 1107 | 53932047 | 53932047
16694 | 16701 | public | spatial_ref_sys | spatial_ref_sys_pkey | 654 | 654 | 654
17971 | 17985 | public | groups | groups_goupid | 405 | 247659 | 183330
18015 | 18044 | public | results | results_groupid | 12 | 12 | 12
17971 | 17984 | public | groups | groups_projectid | 12 | 12 | 12
18015 | 18045 | public | results | results_taskid | 3 | 27 | 27
17986 | 18005 | public | tasks | tasks_groupid | 0 | 0 | 0
17986 | 18004 | public | tasks | tasks_task_id | 0 | 0 | 0
(17 rows)
Tests: Setup new schemas and fill with dev data:
- [ ] dont replace ids with ints, but split results into group_results and results
- [ ] switch project and group id with int
-> look at difference of size and performance of accesses.
Updating these findings as of today, just to be sure we're chasing the right problem:
Table and index sizes
SELECT
schema_name,
relname,
pg_size_pretty(table_size) AS size,
table_size
FROM (
SELECT
pg_catalog.pg_namespace.nspname AS schema_name,
relname,
pg_relation_size(pg_catalog.pg_class.oid) AS table_size
FROM pg_catalog.pg_class
JOIN pg_catalog.pg_namespace ON relnamespace = pg_catalog.pg_namespace.oid
) t
WHERE schema_name = 'public' AND table_size > 0 ORDER BY table_size DESC;
gives
schema_name | relname | size | table_size
-------------+----------------------------+------------+--------------
public | results | 141 GB | 151915610112
public | results_pkey | 102 GB | 109276037120
public | results_userid | 78 GB | 83832397824
public | tasks | 76 GB | 81614389248
public | results_projectid | 62 GB | 66519474176
public | results_groupid | 39 GB | 42159013888
public | results_timestamp_date_idx | 39 GB | 41553936384
public | tasks_pkey | 21 GB | 22915842048
public | tasks_projectid | 8546 MB | 8960737280
public | groups | 135 MB | 141942784
public | groups_pkey | 72 MB | 75096064
public | groups_goupid | 43 MB | 45096960
public | groups_projectid | 24 MB | 25591808
public | results_temp_new | 10 MB | 10625024
public | users | 5800 kB | 5939200
public | users_pkey | 4904 kB | 5021696
public | spatial_ref_sys | 4584 kB | 4694016
public | users_userid | 4392 kB | 4497408
public | projects | 2464 kB | 2523136
public | spatial_ref_sys_pkey | 192 kB | 196608
public | projects_pkey | 80 kB | 81920
public | results_temp | 8192 bytes | 8192
(22 rows)
Put another way the results
table weighs 140GB + 320GB of indexes on it.
Checking the usage of these, we get:
SELECT psai.* FROM pg_stat_all_indexes psai LEFT JOIN pg_index i ON i.indexrelid =psai.indexrelid WHERE psai.schemaname = 'public' ORDER BY psai.idx_scan desc;
gives
relid | indexrelid | schemaname | relname | indexrelname | idx_scan | idx_tup_read | idx_tup_fetch
-------+------------+------------+-----------------+----------------------------+----------+--------------+---------------
17963 | 17969 | public | projects | projects_pkey | 24146006 | 67657203 | 44581343
17971 | 17977 | public | groups | groups_pkey | 24024285 | 82795251 | 78013996
18015 | 18021 | public | results | results_pkey | 13490408 | 18344 | 3684
17986 | 17992 | public | tasks | tasks_pkey | 13480209 | 13712283 | 13667019
18007 | 554848 | public | users | users_userid | 13457413 | 13457413 | 13457413
18015 | 18043 | public | results | results_projectid | 89532 | 27161546942 | 27161531590
18007 | 18013 | public | users | users_pkey | 3687 | 0 | 0
18015 | 549882 | public | results | results_timestamp_date_idx | 3600 | 35109505 | 35104591
16694 | 16701 | public | spatial_ref_sys | spatial_ref_sys_pkey | 1800 | 1800 | 1800
17986 | 18006 | public | tasks | tasks_projectid | 167 | 3947303 | 3924671
17971 | 17985 | public | groups | groups_goupid | 57 | 27253 | 27099
18015 | 18044 | public | results | results_groupid | 0 | 0 | 0
18015 | 18046 | public | results | results_userid | 0 | 0 | 0
17971 | 17984 | public | groups | groups_projectid | 0 | 0 | 0
(14 rows)
We can drop:
-
results_userid
- 78GB -
results_projectid
- 62GB -
results_groupId
- 39GB -
result_timestamp_date_idx
- 39GB TOTAL: ~220GB off