galaxy Hard removal of items in the database (scalability discussion)

The following is a subset of items identified during the scalability discussion at the team meeting in Montpellier (May 16-20, 2022):

[ ] Identify database objects (tables, columns) that consume the most resources.
[ ] Determine feasibility of deleting such table rows or individual columns to free up space. Can we routinely delete sufficiently old data? Can we automate this?
- job_metric_* tables?
- job_state_history? (trim? remove if not used?)
- sessions (galaxy_session) can be deleted and should be
- etc...
[ ] For deleting rows, consider limited usage of cascading deletes of associated records (not applicable to galaxy_user table)
[ ] Identify regeneratable data (HDA metadata, HDA peek, ...)
[ ] Add history_dataset_association_history to cleanup scripts (purge after 10 days? a month?) (ref: #13307, #13308)
[ ] Move metadata column from HDA table into an associated table. Only copy metadata on write.
[ ] Metadata: serialize to disk for long term storage; use database as cache (consider LRU replacement policy)
[ ] For purged datasets, remove metadata column and stderr/stdout
[x] Drop the update_time column from job_state_history table #13997
[ ] Workflow invocation view only grows, there is no way to remove it even just from the UX (?)
[ ] Consider separating data needed to run galaxy from historical data (possibly needed as training data, or data for usage stats, etc.); Can we limit such historical data to the most recent n years?

Jun 01 '22 17:06 jdavcs

database objects (tables, columns) that consume the most resources

I made this graph a long time ago to answer the same question, hope it helps! https://stats.galaxyproject.eu/d/000000019/galaxy-database?orgId=1&refresh=5m&viewPanel=40

Jun 02 '22 08:06 hexylena

I made this graph a long time ago to answer the same question, hope it helps!

Oh yes, absolutely, thanks! I have a spreadsheet with reasonably precise estimates of number of rows per table on main, but that one is from 2020, so I'll update it and then post it here. (while size per table is a more useful metric, a curious detail, at least from 2 years ago, was that 42 tables had zero rows).

Jun 02 '22 14:06 jdavcs

Really easy to get from the following query if you just need fast estimates:

SELECT reltuples, relname FROM pg_class where relname not like 'pg_%' and relname not like 'ix_%' and relname not like '%_pkey' order by reltuples desc;

   reltuples   |                             relname
---------------+-----------------------------------------------------------------
  5.796195e+08 | job_metric_numeric
 4.5395987e+08 | job_parameter
 3.0803376e+08 | job_state_history
 1.6113411e+08 | history_dataset_association_history
  1.511119e+08 | dataset_permissions
 1.0779887e+08 | history_dataset_association
  9.121957e+07 | job_to_input_dataset
  8.872237e+07 | dataset
  8.311228e+07 | dataset_collection_element
 7.9665984e+07 | job_to_output_dataset
 4.7278964e+07 | job
 4.2060216e+07 | job_to_output_dataset_collection
 4.1388948e+07 | job_metric_text
 3.2805136e+07 | cleanup_event_dataset_association
 3.2322692e+07 | implicit_collection_jobs_job_association
 2.5986542e+07 | galaxy_session
 2.0864564e+07 | cleanup_event_hda_association

~I'll add it to gxadmin since that's a useful query to have.~ Edit it's there. gxadmin query pg-rows-per-table

$ gxadmin query pg-rows-per-table | head
 table_schema |                           table_name                           |     rows
--------------+----------------------------------------------------------------+---------------
 galaxy       | job_metric_numeric                                             |  5.796195e+08
 galaxy       | job_parameter                                                  | 4.5395987e+08
 galaxy       | job_state_history                                              | 3.0803376e+08
 galaxy       | history_dataset_association_history                            | 1.6113411e+08
 galaxy       | dataset_permissions                                            |  1.511119e+08
 galaxy       | history_dataset_association                                    | 1.0779887e+08
 galaxy       | job_to_input_dataset                                           |  9.121957e+07
 galaxy       | dataset                                                        |  8.872237e+07

Edit 2: 50 empty tables

$ gxadmin query pg-rows-per-table | grep ' 0' | wc -l
50

Jun 02 '22 14:06 hexylena

Really easy to get from the following query if you just need fast estimates:

Thank you! Very easy indeed. So we have 110(!) empty tables (which includes a few that are not in the model, but still..)

Jun 02 '22 14:06 jdavcs

And for the record, here's the query I was using in Montpellier to get some of the space usage numbers:

galaxy_main=> SELECT *, pg_size_pretty(total_bytes) AS total
    , pg_size_pretty(index_bytes) AS index
    , pg_size_pretty(toast_bytes) AS toast
    , pg_size_pretty(table_bytes) AS table
  FROM (
  SELECT *, total_bytes-index_bytes-coalesce(toast_bytes,0) AS table_bytes FROM (
      SELECT c.oid,nspname AS table_schema, relname AS table_name
              , c.reltuples AS row_estimate
              , pg_total_relation_size(c.oid) AS total_bytes
              , pg_indexes_size(c.oid) AS index_bytes
              , pg_total_relation_size(reltoastrelid) AS toast_bytes
          FROM pg_class c
          LEFT JOIN pg_namespace n ON n.oid = c.relnamespace
          WHERE relkind = 'r'
  ) a
) a order by a.total_bytes desc;

Just total bytes alone would be useful as a gxadmin query to use as a Grafana data source.

Jun 02 '22 15:06 natefoo

UPDATED 4/29/25: estimates of rows per table on main:

galaxy_main=> SELECT reltuples as row_estimate , relname FROM pg_class where relname not like 'pg_%' and relname not like 'ix_%' and relname not like '%_pkey' order by reltuples desc;
 row_estimate  |                             relname                             
---------------+-----------------------------------------------------------------
 7.3432563e+08 | job_metric_numeric
 5.3123466e+08 | job_parameter
 2.4797253e+08 | job_state_history
  1.736919e+08 | galaxy_session_session_key_key
 1.7361538e+08 | galaxy_session_session_key_key_hash
 1.7158123e+08 | galaxy_session
 1.5102541e+08 | history_dataset_association
 1.2499661e+08 | uq_uuid_column
 1.2499661e+08 | dataset
 1.2348484e+08 | job_to_output_dataset
 1.1306739e+08 | job_to_input_dataset
 1.0776312e+08 | dataset_permissions
  7.100435e+07 | cleanup_event_dataset_association
  7.097793e+07 | dataset_collection_element
 6.7074128e+07 | job_create_time_idx
  6.548636e+07 | job
  5.979472e+07 | history_dataset_association_history
 5.5010304e+07 | job_metric_text
 3.8546484e+07 | cleanup_event_hda_association
 2.7909732e+07 | job_external_output_metadata
 2.6947266e+07 | job_to_output_dataset_collection
...

Apr 01 '25 13:04 jdavcs

Nate's query, as of 4/29/25, everything over 10G, ordered by total size:

galaxy_main=> SELECT 
    table_name
    , row_estimate
    , pg_size_pretty(total_bytes) AS TOTAL
    , pg_size_pretty(index_bytes) AS index
    , pg_size_pretty(toast_bytes) AS toast
    , pg_size_pretty(table_bytes) AS table
  FROM (
  SELECT *, total_bytes-index_bytes-coalesce(toast_bytes,0) AS table_bytes FROM (
      SELECT relname AS table_name
              , c.reltuples AS row_estimate
              , pg_total_relation_size(c.oid) AS total_bytes
              , pg_indexes_size(c.oid) AS index_bytes
              , pg_total_relation_size(reltoastrelid) AS toast_bytes
          FROM pg_class c
          LEFT JOIN pg_namespace n ON n.oid = c.relnamespace
          WHERE relkind = 'r' 
  ) a 
) a order by a.total_bytes desc;
                           table_name                           | row_estimate  |   total    |   index    |   toast    |   table    
----------------------------------------------------------------+---------------+------------+------------+------------+------------
 history_dataset_association                                    | 1.5102541e+08 | 183 GB     | 24 GB      | 67 GB      | 92 GB
 job                                                            |  6.548636e+07 | 103 GB     | 17 GB      | 19 GB      | 66 GB
 job_metric_numeric                                             | 7.3432563e+08 | 72 GB      | 22 GB      | 8192 bytes | 50 GB
 job_parameter                                                  | 5.3123466e+08 | 72 GB      | 16 GB      | 4202 MB    | 51 GB
 galaxy_session                                                 | 1.7158123e+08 | 51 GB      | 26 GB      | 568 kB     | 25 GB
 history_dataset_association_history                            |  5.979472e+07 | 44 GB      | 3348 MB    | 27 GB      | 14 GB
 dataset                                                        | 1.2499661e+08 | 39 GB      | 24 GB      | 8192 bytes | 16 GB
 job_state_history                                              | 2.4797253e+08 | 23 GB      | 10157 MB   |            | 13 GB
 job_external_output_metadata                                   | 2.7909732e+07 | 17 GB      | 1906 MB    | 8192 bytes | 15 GB
 job_to_output_dataset                                          | 1.2348484e+08 | 15 GB      | 7292 MB    |            | 7658 MB
 dataset_permissions                                            | 1.0776312e+08 | 14 GB      | 5564 MB    | 8192 bytes | 9141 MB
 job_to_input_dataset                                           | 1.1306739e+08 | 13 GB      | 6005 MB    |            | 6841 MB
 job_metric_text                                                | 5.5010304e+07 | 11 GB      | 2468 MB    | 8192 bytes | 9240 MB

Apr 29 '25 19:04 jdavcs