Indexing is too slow
The recent update adding full attribute indexing (#241) has slowed down indexing such that it takes greater than 30 minutes to index experiments with more than 1700 files and is killed when run on a login node
https://github.com/COSIMA/master_index/issues/16
Unfortunately indexing with a dask client doesn't work (#239).
Some possible solutions:
-
Set a maximum number of files to index. It would be pretty straightforward to loop over batches of files at this point , and commit after each batch, but this wouldn't stop the process being killed, but would mean that the automatic indexing would gradually add more and more files to the DB each time it is run. It is more fault tolerant and I'm in favour of doing this anyway for this reason.
-
Use
multiprocessingto parallelise the batch processing above. The error with thedaskclient occurs when trying to use SQL objects created in different threads. This wouldn't be an issue if the threading/parallelisation was at this higher level as all data structures are local to a thread, and each thread would commit to the DB. sqlite is not recommended for high levels of concurrency, but on the face it this doesn't seem too onerous. Splitting the indexing into separate processes will speed up the indexing significantly, potentially overcoming the 30 minute hard limit, and also distributing load between separate processes has the potential to avoid other resource limits.
@angus-g I profiled creating a database from a single experiment with approximately 100 files, the crude timing was
66.21user 16.44system 3:20.90elapsed 41%CPU (0avgtext+0avgdata 333892maxresident)k
The cProfile data file is available here:
https://www.dropbox.com/s/kfpg527sokbulgf/profile.dat?dl=0
I'm using snakeviz to examine the profiling data, and it seems it spends 151s (50% of the time) in _validate_ncattribute
https://github.com/COSIMA/cosima-cookbook/blob/ac2d674bf7639558e125d390457fcf6692b92b7a/cosima_cookbook/database.py#L240-L269
and through two paths spends 150s in _setup_ncattribute
https://github.com/COSIMA/cosima-cookbook/blob/ac2d674bf7639558e125d390457fcf6692b92b7a/cosima_cookbook/database.py#L170-L195
When I changed back the ordering of the cache lookup here
https://github.com/COSIMA/cosima-cookbook/pull/283/files#r833895505
Same job took 30s, and the index_file stuff that took 150s now takes 10s.
in https://github.com/COSIMA/cosima-cookbook/pull/283 @ScottWales said
Change cache system in setup_ncattribute to first check the database for matches before looking at the cache to fix an ordering issue
and fair enough:
I've not explored the performance impact of the setup_ncattribute change
Looks like it is deleterious to performance. What is the nature of the "ordering issue" @ScottWales ?