htslib
htslib copied to clipboard
Add mmap hfile backend
The eventual use of this PR will be to replace the reference file wrapper in the reference fetching logic from an mFILE
to an mmapped hFILE. This will eventually be used in pushed code to #589)
Is there a reason why the mmap-based-hFILE backend in https://github.com/samtools/htslib-plugins doesn't fit your needs? (Especially if you updated it to use a fixed buffer.)
@jmarshall the intention of this PR is to replace mFILE
in the reference fetching logic (the replacement of this will be in #589, I'll push this tomorrow). I don't think it's a good idea require a plugin for core functionality, however I hadn't been aware of this plugin before - I could read and incorporate it into this PR tomorrow
^ also just made an edit description that matches the purpose of this PR
That's probably my fault giving incorrect advice to Thomas - I didn't realise there was even an mmap hFILE already, although is there a reason why it's hidden away in the plugin package instead of main stream? Mentally I label that repo as "here be dragons" and avoid it, due to the association with iRODS. ;-)
Our intention here is to remove the last reason for the existence of mFILE which in turn will aid lifting the reference querying machinery out of the cram subdirectory.
That plugin code followed conversations with @lairdm (see lairdm/develop circa August 2016), who was interested in using mmap
with faidx. I think his experiments ended with “it's not actually faster” for his use case, but the mmap plugin remained an interesting example. A plugin of course because it's not core and not all platforms have mmap()
.
I am certainly in favour of exterminating mFILE. Last I looked I thought mmap wasn't actually activated (so plain hFILE_fd
would suffice), but I see that changed with PR #187 and there are useful explanations there.
Whether you access a local file via read
or mmap
is an aspect of access method rather than it being a different file, so this would be more naturally a variant of hFILE_fd
selected via a mode
letter or vopen()
mode argument. However this is a bit of a pain and maybe it is less painful overall to make your users stick “mmap:” on the front of their filenames… (though that's pretty painful in C!)
Since mmap
is already in HTSlib anyway, sure this might as well be a core hFILE capability, but
-
Please use the plugin's
hopen_mmap
implementation instead so it's not polluted byfopen()
vagaries. -
Just wrap the whole thing in
#ifdef HAVE_MMAP
/#endif
(instead of wrapping the individual function bodies), as there is already code to return a suitable errno when!HAVE_MMAP
. -
Just use
hfile_add_scheme_handler()
directly, rather thaninit_add_plugin
which is for code in separate plugins or separate source files. -
Please figure out what's really going on with mem [sic] files are declared remote so they work with a tabix index, because these really are local files.
Actually selecting this via a mode letter hint turns out to be rather tidy, means it works with hdopen()
too, and follows glibc's (and mFILE's) lead. You might consider looking at jmarshall/htslib@5ad5d8dee9aaf3e77323261854fbb0e1e120ba2e as a starting point…
Just done a little bit of an experiment with the performance of the using a mmap inside a hfile (using @jmarshall's implementation) vs using a hfile vs an operating directly on the mmapped memory on the new reference fetching logic I'm going to push to #589. Here are my results:
Normal mmap | Normal hfile | hfile with "m" mode | |
---|---|---|---|
Sequential test | U: 2.095 S: 0.226 | U: 2.168 S: 0.251 | U: 2.211 S: 0.258 |
Parallel test | U: 17.043 S: 1.691 | U: 17.405 S: 1.817 | U: 17.464 S: 1.787 |
where the sequential test is executing the command:
time samtools view ~/NA12878.alt_bwamem_GRCh38DH.20150826.CEU.exome.cram | head -1000000 > /dev/null
and the parallel test is executing the command:
time parallel --ungroup "echo Started {} && samtools view ~/NA12878.alt_bwamem_GRCh38DH.20150826.CEU.exome.cram | head -500000 > /dev/null && echo {} Finished" ::: {1..10}
S indicates the system time and U the user time, according to GNU time.
Both of these commands are being executed on my mac, which has specifications here: https://support.apple.com/kb/sp715?locale=en_GB. It might be better to do benchmarking on a machine that is more likely to run it (i.e. have more cores), but this is an indication.
It may be best to use "samtools view -c" to count records. Otherwise you're largely just benchmarking how long it takes to print up SAM records (which isn't something htslib shines with). Try BAM too as the decode times are smaller and you'll see the impact of I/O more than with CRAM.
I've now got around to doing this benchmarking now. To test the reference fetching code exclusively, I've composed a cram file which is formed from a single sample that doesn't differ from the reference, on chromosome 22:
$ cp reference.fa reference.fq
$ bwa index reference.fa
$ bwa mem reference.fa reference.fq > ref.sam
$ samtools view -C -T reference.fa ref.sam > ref.cram
$ echo -e "@SQ\tSN:chr22\tM5:ac37ec46683600f808cdd41eac1d55cd" > new_header
$ samtools reheader -i new_header ref.cram
Then I applied a patch on #589:
diff --git a/cram/cram_io.c b/cram/cram_io.c
index 1b175e2..9457f80 100644
--- a/cram/cram_io.c
+++ b/cram/cram_io.c
@@ -44,6 +44,7 @@ OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
* a way to return errors for when malloc fails.
*/
+#include <fcntl.h>
#include <config.h>
#include <stdio.h>
@@ -68,6 +69,7 @@ OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#include <math.h>
#include <time.h>
#include <stdint.h>
+#include <sys/mman.h>
#ifdef HAVE_LIBDEFLATE
#include <libdeflate.h>
@@ -1500,8 +1502,8 @@ char *cram_content_type2str(enum cram_content_type t) {
* Frees/unmaps a reference sequence and associated file handles.
*/
static void ref_entry_free_seq(ref_entry *e) {
- if (e->seq)
- free(e->seq);
+ // if (e->seq)
+ // free(e->seq);
e->seq = NULL;
}
@@ -2006,6 +2008,12 @@ static int cram_populate_ref(cram_fd *fd, int id, ref_entry *r) {
}
r->length = st.st_size;
+ if (0){
+ int ref_fd = open(ref_fn, O_RDONLY);
+ char* mapped_file = mmap(0, st.st_size, PROT_READ, MAP_SHARED, ref_fd, 0);
+ r->seq = mapped_file;
+ }
+ else{
r->offset = r->line_length = r->bases_per_line = 0;
fd->refs->fn = r->fn = string_dup(fd->refs->pool, ref_fn);
@@ -2018,6 +2026,7 @@ static int cram_populate_ref(cram_fd *fd, int id, ref_entry *r) {
fd->refs->fp = bgzf_open_ref(ref_fn, "r", 1);
}
+ }
free(ref_fn);
I then ran:
$ for i in `seq 1 50`; do time samtools view ref.cram &> /dev/null; done 2> unmmapped_times
<CHANGE CONDITION, RECOMPLE>
$ for i in `seq 1 50`; do time samtools view ref.cram &> /dev/null; done 2> mmapped_times
$ grep real unmmapped_times | cut -c 10-12 > unmmapped_real_times
$ grep real mmapped_times | cut -c 10-12 > mapped_real_times
Below is a summary of the times:
$ r
R version 3.5.0 (2018-04-23) -- "Joy in Playing"
Copyright (C) 2018 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin16.7.0 (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> unmmapped <- read.csv(file="unmmapped_real_times")
> mmapped <- read.csv(file="mmapped_real_times")
> summary(unmmapped)
X254
Min. :897.0
1st Qu.:917.0
Median :927.0
Mean :927.4
3rd Qu.:934.0
Max. :966.0
> summary(mmapped)
X643
Min. :612.0
1st Qu.:621.0
Median :641.0
Mean :680.6
3rd Qu.:710.0
Max. :918.0
>
In this situation, it looks like mmapping the reference is quicker than reading using the previous method
I'm starting to look at this again now.
I prefer @jmarshall's implementation using the 'm' field in the mode string as it's cleaner to use and doesn't require juggling of temporary buffers to prepend "mmap:" to filenames either. It's also closer to the way the mFILE code works that we'll be replacing here.
It's questionable whether we should permit mmap on all of the modes, eg append isn't going to work, but that's arguably a calling code error.
Starting from @jmarshall's branch, I extended this to replace mFILE.c with hfiles instead. It needs more testing and is a work in progress, but this is the code so far:
https://github.com/jkbonfield/htslib/tree/m-mode-mmap
NB: still needing fixes, eg for Windows:
hfile.c: In function 'hdopen':
hfile.c:675:17: error: storage size of 'sbuf' isn't known
struct stat sbuf;
^~~~
hfile.c:676:21: warning: implicit declaration of function 'fstat' [-Wimplicit-function-declaration]
int fstat_ok = (fstat(fd, &sbuf) == 0);
^~~~~
hfile.c:676:9: warning: unused variable 'fstat_ok' [-Wunused-variable]
int fstat_ok = (fstat(fd, &sbuf) == 0);
^~~~~~~~
hfile.c:675:17: warning: unused variable 'sbuf' [-Wunused-variable]
struct stat sbuf;
^~~~
make: *** [Makefile:121: hfile.o] Error 1
make: *** Waiting for unfinished jobs....
Command exited with code 2
It's probably a good idea to implement a read
function which doesn't involve copying, to actually get the speed benefit of mmaping as well. I've also got replaced mFILEs in #589, so it's probably a good idea to look at getting that merged first