cylc-flow database refactor

Note: Tagged against 8.0.0 as this milestone would be a good time to introduce breaking changes, however, this is not required for 8.0.0 and the interface change is less significant now that we have the GraphQL interface.

Note: We have already made breaking changes to the DB in Cylc8.

Note: We will soon need to start building functionality into the UIS to read the DB to provide offline data.

The run database can get quite large >750MB, note we have two copies of the database so that's 1.5GB. It should be pretty easy to bring this down in size.

Suggestions:

[x] Tidy the prereqs table. #3863
[x] Drop the checkpoint tables. #3906
- #3864
- These are fairly small so there is little space gain here.
- Fewer tables would make the database a bit clearer though.
[ ] Record timestamps as unix epoch integers rather than text fields.
- This should bring a significant reduction in storage requirement.
- Sqlite has date/time functions to make DB queries with integer timestamps nicer.
- We can also define our own functions for use on the Python side if we want - https://docs.python.org/3/library/sqlite3.html?highlight=sqlite3#sqlite3.Connection.create_function
[ ] Enumerate task/job statuses to store them as integers rather than text fields.
- Same pros/cons as above
- Could use a single character for clarity e.g. 'w', 'r', 's', 'f'.
[ ] Move away from compound primary keys e.g. (name, cycle_point).
- These text fields are duplicated across multiple tables which is inefficient.
- E.G. create a table of tasks (id INTEGER AUTONUMBER, name TEXT, cycle_point TEXT)
- Compound text fields could then be replaced with integers.
- Querying by task name would still work by table relations.
- E.G. the task pool should only be adding state information, everything else should be stored elsewhere.
[ ] Have a quick check to see if any fields can be removed:
- For example I'm pretty sure we don't need the time_created field in task_states.
[ ] Consider refactoring task_events and task_jobs
- These tables contain duplicate information in different formats.
- The task_events table is functionless.
- These tables account for 60-70% of memory usage.
- It would be good if this information was sufficient to reconstruct the evolution of the graph after the fact.
  - Useful for reconstructing prerequisite objects for historical tasks.
  - https://github.com/cylc/cylc-flow/pull/4581#issuecomment-1021061069
[ ] Consider using "write ahead log".

~Since the DB functionality is pretty self-contained and we don't require upgraders at this point these changes shouldn't be too much work.~ As we are now approaching 8.0.rc1 release we will need to provide full DB back-compat support for any changes.

Oct 15 '20 10:10 oliver-sanders

Most or all of the above are desirable, but maybe not on the critical path to Cylc 8 - except that, as you note, now is a good time to make breaking changes. I'd still rather get the 100% critical bits in the bag first though.

Oct 15 '20 10:10 hjoliver

+1 to Cylc 8 or Cylc 9.

For other databases, like Postgres or Oracle, it would be simpler if the Cylc database supported multiple workflows, instead of one database per workflow.

That would require modifying the database and adding more tables, more relationships. But without doing this, it would be quite hard to implement #3360.

Oct 15 '20 11:10 kinow

Could do that by adding a workflow table then adding a workflow id (integer) field to each entry?

Would need to check the integer value limit, if numbering job submissions for multiple workflows in one table Cylc could very quickly rack up some pretty big numbers!

Oct 15 '20 11:10 oliver-sanders

Could do that by adding a workflow table then adding a workflow id (integer) field to each entry?

I think that makes sense. And probably the simplest way to implement that. Later we can add other tables/fields/indexes/etc.

Would need to check the integer value limit, if numbering job submissions for multiple workflows in one table Cylc could very quickly rack up some pretty big numbers!

:+1: good idea.

Oct 15 '20 12:10 kinow

#864

Oct 29 '20 11:10 oliver-sanders

If we address this one further down the line it might be worth considering moving to a graph DB such as Dgraph. Would simplify UIS offline data and allow DB structure to more closely match the live data store.

Jun 03 '21 11:06 oliver-sanders

Nice idea.

Jun 03 '21 21:06 hjoliver

Record timestamps as unix epoch integers rather than text fields

We should check that we will not run into the year 2038 problem. According to the sqlite3 docs the integer datatype is capable of being 64-bit (8 bytes), so should be okay.

INTEGER. The value is a signed integer, stored in 1, 2, 3, 4, 6, or 8 bytes depending on the magnitude of the value.

Aug 24 '21 12:08 MetRonnie

We should also look at finding an efficient way to perpetual logging prerequisites/outputs into this refactor - see https://github.com/cylc/cylc-flow/issues/4036#issuecomment-1125701742

[HO] note to store the state of prerequisites at trigger time (which is probably the important thing for past tasks) requires some additional information not currently available from the DB, because (for conditionally triggered tasks) that is not the same as the final state of the upstream task outputs.

May 13 '22 10:05 oliver-sanders

I think we are currently missing job messages from the DB, might be a good opportunity to get them in now that messages are visible in the GUI - https://github.com/cylc/cylc-flow/issues/2394#issuecomment-1362931707

Dec 22 '22 14:12 oliver-sanders