raptor HIBF creates a very large index

Hi

I have been trying to build an index of a large collection of microbial genomes (102999) using HIBF and the resulting index is way larger than when I create the same index using IBF.

The raptor version I used:

VERSION
    Last update: 2023-08-30
    Raptor version: 3.1.0-rc.1 (raptor-v3.0.0-146-gedec71b5a2c19a2203278db814b3362ddb98e9e6)
    Sharg version: 1.1.1
    SeqAn version: 3.4.0-rc.1

The layout stat file:

## ### Parameters ###
## number of user bins = 102999
## number of hash functions = 2
## false positive rate = 0.05
## ### Notation ###
## X-IBF = An IBF with X number of bins.
## X-HIBF = An HIBF with tmax = X, e.g a maximum of X technical bins on each level.
## ### Column Description ###
## tmax : The maximum number of technical bin on each level
## c_tmax : The technical extra cost of querying an tmax-IBF, compared to 64-IBF
## l_tmax : The estimated query cost for an tmax-HIBF, compared to an 64-HIBF
## m_tmax : The estimated memory consumption for an tmax-HIBF, compared to an 64-HIBF
## (l*m)_tmax : Computed by l_tmax * m_tmax
## size : The expected total size of an tmax-HIBF
# tmax  c_tmax  l_tmax  m_tmax  (l*m)_tmax      size
64      1.00    0.00    1.00    0.00    424.3GiB
384     1.51    3.34    1.48    4.96    630.0GiB
# Best t_max (regarding expected query runtime): 64

The prepare and layout and build commands I used:

raptor prepare --input genomes.lst --output genomes_k20_w20 --kmer 20 --window 20 --threads 32
raptor layout --input-file genomes_k20_w20/minimiser.list --output-sketches-to genomes_k20_w20 \
    --determine-best-tmax --kmer-size 20 --false-positive-rate 0.05 --threads 32 \
    --output-filename genomes_k20_w20_binning
raptor build --input genomes_k20_w20_binning --output genomes_k20_w20.index --threads 32

The final index is ~1Tb, and these are the timings of building the index, where it had a peak memory usage of ~3Tb:

============= Timings =============
Wall clock time [s]: 40397.13
Peak memory usage [TiB]: 2.9
Index allocation [s]: 0.00
User bin I/O avg per thread [s]: 0.00
User bin I/O sum [s]: 0.00
Merge kmer sets avg per thread [s]: 0.00
Merge kmer sets sum [s]: 0.00
Fill IBF avg per thread [s]: 0.00
Fill IBF sum [s]: 0.00
Store index [s]: 0.00

The IBF index is ~750G and required a fraction of the memory to build the index. Shouldn't the HBIF be smaller than the IBF index? Any suggestions are much appreciated :-)

Thanks Antonio

Sep 01 '23 07:09 genomewalker

Hey there!

Version

The version you are using has some major refactorings. That's also why the Timings show 0.00 seconds for most of the statistics.

The results should be the same (unit tests are fine), but I haven't benchmarked the performance yet. You could use the latest release (3.0.1), but I don't think that the results would be different.

EDIT: One bug that I just encountered, and that will be fixed soon is that raptor build will always use the same number of threads as used for raptor layout, ignoring the raptor build --threads option.

Layout

It looks like it will use t_max = 64, so the HIBF will have at least 3 levels (log_64(102999) is about 2.8). This may result in a bigger index size than using only 2 levels (t_max = 384).

We will have to investigate why the estimation of the size (424.3GiB) is so far off the actual size.

Building RAM

The memory usage looks way too high. This might be due to the t_max = 64. When building in parallel, we store the k-mers that we insert in lower levels to reuse in upper levels. With a small t_max, we have to store more content, which increases memory usage.

Index Size

Whether the HIBF is smaller than the IBF depends on the data and t_max. The worst case for the HIBF is when all the genomes are equally sized (size = number of unique k-mers). Let's say all genomes are equally sized, and we have 4096 genomes. Then an HIBF with t_max = 64 would have two layers. The top-level has 64 bins available, and each of these bins would be 64 of the original genomes (64*64=4096). So we would store all k-mers in the top-level, and then the k-mers of 64 genomes for each of the 64 lower-level. Long story short, we would, in this worst case, have an index of twice the size of the IBF.

When using 3 levels, this might get worse, depending on the data. It looks like your data is quite unevenly sized (750GiB vs 1TiB, even though there are 3 layers). This might also improve when using t_max = 384.

Questions/Suggestions

Try running the layout without --determine-best-tmax. It should then default to using t_max = 384.
Note: If you have exactly one file per genome, you can also skip raptor prepare. But since you've already run it, you can just reuse the minimiser.list for raptor layout.
Is the list of genomes something that you can share; and are the genomes freely available? Then we could also try it ourselves.
Can you share the layout file? You should be able to attach a gzipped file to a GitHub comment.

Sep 01 '23 09:09 eseiler

Hi @eseiler

thank you very much for your prompt answer, it is very useful. I will try your recommendations :-)

You can get the genome fasta files from here and the layout files here

Sep 01 '23 13:09 genomewalker

An update on this, without specifying --determine-best-tmax now the index is only 588GiB and the peak memory has been 590GiB

Sep 02 '23 06:09 genomewalker

raptor raptor copied to clipboard

HIBF creates a very large index

Version

Layout

Building RAM

Index Size

Questions/Suggestions

raptor
raptor copied to clipboard