cerebra icon indicating copy to clipboard operation
cerebra copied to clipboard

add prebuilt index for genome.fa

Open lincoln-harris opened this issue 4 years ago • 2 comments

can we add a prebuilt index for the human genome .gtf / .fa that would load much faster?

hg38.fa -> 3 Gb hg38.gtf -> 144 Mb

lincoln-harris avatar Jun 08 '20 18:06 lincoln-harris

the rate limiting step here is genome interval tree construction, rather than building the genome.fa index. not sure what to do about this?

lincoln-harris avatar Jun 08 '20 18:06 lincoln-harris

Have you confirmed that building the interval tree is the main contributor for startup time? If so, it's perhaps worth taking a look at whether a majority of the time is spent doing calls in Python or if more time is spent in the low-level C code that NCLS uses for the underlying interval tree implementation. If the latter constitutes a majority of the time spent, then it may be worth optimizing the low level code. Because the interval tree only needs to be built once and then can be used on multiple threads, perhaps a custom low-level implementation of an interval tree that allows it to be built cooperatively by multiple threads would speed up this process.

rvanheusden avatar Jun 10 '20 22:06 rvanheusden