cosima-cookbook
cosima-cookbook copied to clipboard
Explorer tools performance
The time it takes to load the Database Explorer seems to have increased quite markedly recently.
import cosima_cookbook as cc
session = cc.database.create_session(db='/g/data/ik11/databases/cosima_master_2022-07-01.db')
%%time
from cosima_cookbook import explore
dbx = explore.DatabaseExplorer(session=session)
dbx
CPU times: user 1min 16s, sys: 16 s, total: 1min 32s
Wall time: 4min 16s
Previous timings when the explorer was retooled because it was taking a minute to start up suggests this is a recent, and significant, slow down
https://github.com/COSIMA/cosima-cookbook/pull/274#issuecomment-956851439
TL;DR it used to take 17s
I don't know if it's a red herring, but I'm still getting 17s from the database on tmpfs (disk latency seems important...), after changing:
diff --git a/cosima_cookbook/querying.py b/cosima_cookbook/querying.py
index 8ff2e47..05cf3c1 100644
--- a/cosima_cookbook/querying.py
+++ b/cosima_cookbook/querying.py
@@ -210,7 +210,7 @@ def get_variables(
q = q.outerjoin(subq, subq.c.ncvar_id == NCVar.id)
q = q.order_by(NCFile.frequency, CFVariable.name, NCFile.time_start, NCFile.ncfile)
- q = q.group_by(CFVariable, NCFile.frequency)
+ q = q.group_by(CFVariable.id, NCFile.frequency)
if experiment is not None:
q = q.group_by(subq.c.value)
Before the change was about 30s:
before:
>>> %timeit -r3 dbx = cc.explore.DatabaseExplorer(session=session)
32.3 s ± 559 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)
after:
>>> %timeit -r3 dbx = cc.explore.DatabaseExplorer(session=session)
17.1 s ± 60.2 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)
I think any latency beyond that is down to serving the database over lustre, which is unavoidable...
Nice detective work, and a good idea to take the disk latency out of the equation. I don't know why it used to be so much faster. It was on the OOD, so perhaps the IO was being saturated. I know that it has always been slower on the cloud infrastructure. Now I am intrigued by the option of loading the DB into memory to improve performance.
A couple of quick tests on gadi
suggest a fair amount of variability, and still decently long disk access time
$ time python -c 'import cosima_cookbook as cc; dbx = cc.explore.DatabaseExplorer()'
real 2m38.213s
user 0m56.782s
sys 0m21.232s
$ time python -c 'import cosima_cookbook as cc; dbx = cc.explore.DatabaseExplorer()'
real 1m30.493s
user 0m50.484s
sys 0m19.030s
$ time python -c 'import cosima_cookbook as cc; dbx = cc.explore.DatabaseExplorer()'
real 2m36.051s
user 0m59.151s
sys 0m21.248s
Second was fast because it was still in cache. Third was after ~30min break, so similar to first.
Sounds like that code change is worth making if it halves the time.