CMash For very large databases, creation of TST is slow and memory intensive

Will need to:

[x] modularize the content of MakeStreamingDNADatabase.py
[x] create methods to make the TST the old way
[x] create method to make TST without reading in the whole training database, and yielding the results to mt.Trie()

Work will happen in the branch ModularizeMake

Mar 24 '20 19:03 dkoslicki

Will concentrate first on the TST before modularizing the CountEstimators modularization

Mar 24 '20 19:03 dkoslicki

On a database with 1000 genomes: New timing: 31.374449986004038 Old timing: 36.79385429399554 Old timing (including import time): 39.00447271599842

For memory

pip install psrecord
python Make.py & psrecord $(pidof python) --interval 1 --plot <out_file>.png

New way: new Old way: old

So that's a win for the new method of TST creation.

Mar 24 '20 20:03 dkoslicki

Note to self, piping from C++ marisa trie implementation works as well: LocalTest2.py:

import sys


def main():
    sys.path.append("/home/dkoslicki/Desktop/CMash/")
    from CMash.Make import MakeTSTNew
    small_database_file = "/home/dkoslicki/Desktop/CMash/tests/TempData/cmash_db_n5000_k60_1000.h5"
    TST_export_file_new = "/home/dkoslicki/Desktop/CMash/tests/TempData/cmash_db_n5000_k60_new.tst"
    M = MakeTSTNew(small_database_file, TST_export_file_new)
    for entry in M.yield_trie_items_to_insert_no_import(small_database_file):
        print(entry)


if __name__ == "__main__":
    main()

Then

 python ../../CMash/LocalTest2.py | /usr/bin/time ~/Desktop/marisa-trie/tools/marisa-build > cmash_db_n5000_k60_new.tst

results in same md5sum as old and new methods. Timing is 35.4 sec.

For marisa-trie C++ installation, note:

$ git clone https://github.com/s-yata/marisa-trie.git
$ cd marisa-trie
$ sudo apt-get install autoconf
$ sudo apt-get install libtool
$ autoreconf -i
$ ./configure --enable-native-code

Mar 25 '20 02:03 dkoslicki

And C++ implementation results in the following memory usage via:

python ../../CMash/LocalTest2.py | /usr/bin/time ~/Desktop/marisa-trie/tools/marisa-build > cmash_db_n5000_k60_new.tst & sleep 1;  psrecord $(pidof marisa-build) --interval 1 --plot marisa.png

marisa

So a touch better memory usage. Will need to experiment with C++ CLI opts

Mar 25 '20 02:03 dkoslicki

Problem: MakeTSTNew is still rather memory intensive when creating large databases (~200K genomes) with many hashes (-n 5000). C++ marisa-trie seems to overcome some of this due to its ability to stream input in, but may not be able to overcome the fact that the TST needs to be completely held in memory before being constructed (as each additional key needs to have access to the whole TST to know where to be placed). May need to carefully balance how things are streamed into the TST (so the pipe doesn't need to hold a ton of data as the TST slowly constructs itself).

Alternatives: Search for a different TST option.

Note: this is mainly for very large databases with lots of sketches, so not a massive issue for applications like Metalign atm.

Mar 25 '20 03:03 dkoslicki

Modularization is complete and appears to be working, so merging to master for now as it does give a speedup.

@dkoslicki NOTE: make sure that nothing funky is happening with the file name order in the switch from pool.map to pool.imap edit: According to the docs, it should be in the same order, but I might consider a chunksize larger than 1 (maybe length of iterable/core count?)

Apr 07 '20 19:04 dkoslicki