pp-sketchlib
pp-sketchlib copied to clipboard
Library of sketching functions used by PopPUNK
See https://eigen.tuxfamily.org/dox/classEigen_1_1LLT.html https://eigen.tuxfamily.org/dox/classEigen_1_1LDLT.html https://eigen.tuxfamily.org/dox/TopicUsingBlasLapack.html
See https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac564/6674501 Shoukd be faster
When there gets to be >500k or so sketches in the sketch group performance gets very slow, looks like it's because the metadata cache size isn't large enough: https://forum.hdfgroup.org/t/limit-on-the-number-of-datasets-in-one-group/5892 I...
For duplication checks, it would be useful to keep a hash value of each sequence in the database, which should be easy as we read through all the sequence anyway.
Used in random matches, but can now use the standalone dust library. See https://github.com/mrc-ide/dust/pull/333 and https://mrc-ide.github.io/dust/articles/rng.html#reusing-the-random-random-number-generator-in-other-projects-1
Looks like there's a nice solution to memory mapping in eigen here: https://stackoverflow.com/a/51256597 _Originally posted by @johnlees in https://github.com/johnlees/pp-sketchlib/issues/53#issuecomment-773368230_
Some form of serialisation of databases, and/or JSON representation, would be useful for web interfaces
Useful for repeated queries, as otherwise they would have to be loaded from HDF5 each time
See lines 75-78 of `sketch.cu`. Just need to get a valid first hash in the read
- Sort the columns of bins (N log N), keeping track of index. - Scan through column. Where there is a block of the same value, add one to numerator...