Indexing is too slow

Open aidanheerdegen opened this issue 4 years ago • 1 comments

The recent update adding full attribute indexing (#241) has slowed down indexing such that it takes greater than 30 minutes to index experiments with more than 1700 files and is killed when run on a login node

https://github.com/COSIMA/master_index/issues/16

Unfortunately indexing with a dask client doesn't work (#239).

Some possible solutions:

Set a maximum number of files to index. It would be pretty straightforward to loop over batches of files at this point , and commit after each batch, but this wouldn't stop the process being killed, but would mean that the automatic indexing would gradually add more and more files to the DB each time it is run. It is more fault tolerant and I'm in favour of doing this anyway for this reason.
Use multiprocessing to parallelise the batch processing above. The error with the dask client occurs when trying to use SQL objects created in different threads. This wouldn't be an issue if the threading/parallelisation was at this higher level as all data structures are local to a thread, and each thread would commit to the DB. sqlite is not recommended for high levels of concurrency, but on the face it this doesn't seem too onerous. Splitting the indexing into separate processes will speed up the indexing significantly, potentially overcoming the 30 minute hard limit, and also distributing load between separate processes has the potential to avoid other resource limits.

May 16 '21 23:05 aidanheerdegen

@angus-g I profiled creating a database from a single experiment with approximately 100 files, the crude timing was

66.21user 16.44system 3:20.90elapsed 41%CPU (0avgtext+0avgdata 333892maxresident)k

The cProfile data file is available here:

https://www.dropbox.com/s/kfpg527sokbulgf/profile.dat?dl=0

I'm using snakeviz to examine the profiling data, and it seems it spends 151s (50% of the time) in _validate_ncattribute

https://github.com/COSIMA/cosima-cookbook/blob/ac2d674bf7639558e125d390457fcf6692b92b7a/cosima_cookbook/database.py#L240-L269

and through two paths spends 150s in _setup_ncattribute

https://github.com/COSIMA/cosima-cookbook/blob/ac2d674bf7639558e125d390457fcf6692b92b7a/cosima_cookbook/database.py#L170-L195

When I changed back the ordering of the cache lookup here

https://github.com/COSIMA/cosima-cookbook/pull/283/files#r833895505

Same job took 30s, and the index_file stuff that took 150s now takes 10s.

in https://github.com/COSIMA/cosima-cookbook/pull/283 @ScottWales said

Change cache system in setup_ncattribute to first check the database for matches before looking at the cache to fix an ordering issue

and fair enough:

I've not explored the performance impact of the setup_ncattribute change

Looks like it is deleterious to performance. What is the nature of the "ordering issue" @ScottWales ?

Mar 24 '22 04:03 aidanheerdegen