htslib icon indicating copy to clipboard operation
htslib copied to clipboard

Add mmap hfile backend

Open ThomasHickman opened this issue 6 years ago • 12 comments

The eventual use of this PR will be to replace the reference file wrapper in the reference fetching logic from an mFILE to an mmapped hFILE. This will eventually be used in pushed code to #589)

ThomasHickman avatar Apr 23 '18 15:04 ThomasHickman

Is there a reason why the mmap-based-hFILE backend in https://github.com/samtools/htslib-plugins doesn't fit your needs? (Especially if you updated it to use a fixed buffer.)

jmarshall avatar Apr 23 '18 16:04 jmarshall

@jmarshall the intention of this PR is to replace mFILE in the reference fetching logic (the replacement of this will be in #589, I'll push this tomorrow). I don't think it's a good idea require a plugin for core functionality, however I hadn't been aware of this plugin before - I could read and incorporate it into this PR tomorrow

ThomasHickman avatar Apr 23 '18 17:04 ThomasHickman

^ also just made an edit description that matches the purpose of this PR

ThomasHickman avatar Apr 23 '18 17:04 ThomasHickman

That's probably my fault giving incorrect advice to Thomas - I didn't realise there was even an mmap hFILE already, although is there a reason why it's hidden away in the plugin package instead of main stream? Mentally I label that repo as "here be dragons" and avoid it, due to the association with iRODS. ;-)

Our intention here is to remove the last reason for the existence of mFILE which in turn will aid lifting the reference querying machinery out of the cram subdirectory.

jkbonfield avatar Apr 24 '18 08:04 jkbonfield

That plugin code followed conversations with @lairdm (see lairdm/develop circa August 2016), who was interested in using mmap with faidx. I think his experiments ended with “it's not actually faster” for his use case, but the mmap plugin remained an interesting example. A plugin of course because it's not core and not all platforms have mmap().

I am certainly in favour of exterminating mFILE. Last I looked I thought mmap wasn't actually activated (so plain hFILE_fd would suffice), but I see that changed with PR #187 and there are useful explanations there.

Whether you access a local file via read or mmap is an aspect of access method rather than it being a different file, so this would be more naturally a variant of hFILE_fd selected via a mode letter or vopen() mode argument. However this is a bit of a pain and maybe it is less painful overall to make your users stick “mmap:” on the front of their filenames… (though that's pretty painful in C!)

Since mmap is already in HTSlib anyway, sure this might as well be a core hFILE capability, but

  • Please use the plugin's hopen_mmap implementation instead so it's not polluted by fopen() vagaries.

  • Just wrap the whole thing in #ifdef HAVE_MMAP/#endif (instead of wrapping the individual function bodies), as there is already code to return a suitable errno when !HAVE_MMAP.

  • Just use hfile_add_scheme_handler() directly, rather than init_add_plugin which is for code in separate plugins or separate source files.

  • Please figure out what's really going on with mem [sic] files are declared remote so they work with a tabix index, because these really are local files.

jmarshall avatar Apr 24 '18 14:04 jmarshall

Actually selecting this via a mode letter hint turns out to be rather tidy, means it works with hdopen() too, and follows glibc's (and mFILE's) lead. You might consider looking at jmarshall/htslib@5ad5d8dee9aaf3e77323261854fbb0e1e120ba2e as a starting point…

jmarshall avatar Apr 26 '18 08:04 jmarshall

Just done a little bit of an experiment with the performance of the using a mmap inside a hfile (using @jmarshall's implementation) vs using a hfile vs an operating directly on the mmapped memory on the new reference fetching logic I'm going to push to #589. Here are my results:

Normal mmap Normal hfile hfile with "m" mode
Sequential test U: 2.095 S: 0.226 U: 2.168 S: 0.251 U: 2.211 S: 0.258
Parallel test U: 17.043 S: 1.691 U: 17.405 S: 1.817 U: 17.464 S: 1.787

where the sequential test is executing the command:

time samtools view ~/NA12878.alt_bwamem_GRCh38DH.20150826.CEU.exome.cram | head -1000000 > /dev/null

and the parallel test is executing the command:

time parallel --ungroup "echo Started {} && samtools view ~/NA12878.alt_bwamem_GRCh38DH.20150826.CEU.exome.cram | head -500000 > /dev/null && echo {} Finished" ::: {1..10}

S indicates the system time and U the user time, according to GNU time.

Both of these commands are being executed on my mac, which has specifications here: https://support.apple.com/kb/sp715?locale=en_GB. It might be better to do benchmarking on a machine that is more likely to run it (i.e. have more cores), but this is an indication.

ThomasHickman avatar Apr 27 '18 13:04 ThomasHickman

It may be best to use "samtools view -c" to count records. Otherwise you're largely just benchmarking how long it takes to print up SAM records (which isn't something htslib shines with). Try BAM too as the decode times are smaller and you'll see the impact of I/O more than with CRAM.

jkbonfield avatar Apr 27 '18 13:04 jkbonfield

I've now got around to doing this benchmarking now. To test the reference fetching code exclusively, I've composed a cram file which is formed from a single sample that doesn't differ from the reference, on chromosome 22:

$ cp reference.fa reference.fq
$ bwa index reference.fa
$ bwa mem reference.fa reference.fq > ref.sam
$ samtools view -C -T reference.fa ref.sam > ref.cram
$ echo -e "@SQ\tSN:chr22\tM5:ac37ec46683600f808cdd41eac1d55cd" > new_header
$ samtools reheader -i new_header ref.cram

Then I applied a patch on #589:

diff --git a/cram/cram_io.c b/cram/cram_io.c
index 1b175e2..9457f80 100644
--- a/cram/cram_io.c
+++ b/cram/cram_io.c
@@ -44,6 +44,7 @@ OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  * a way to return errors for when malloc fails.
  */

+#include <fcntl.h>
 #include <config.h>

 #include <stdio.h>
@@ -68,6 +69,7 @@ OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 #include <math.h>
 #include <time.h>
 #include <stdint.h>
+#include <sys/mman.h>

 #ifdef HAVE_LIBDEFLATE
 #include <libdeflate.h>
@@ -1500,8 +1502,8 @@ char *cram_content_type2str(enum cram_content_type t) {
  * Frees/unmaps a reference sequence and associated file handles.
  */
 static void ref_entry_free_seq(ref_entry *e) {
-    if (e->seq)
-        free(e->seq);
+    // if (e->seq)
+    //     free(e->seq);

     e->seq = NULL;
 }
@@ -2006,6 +2008,12 @@ static int cram_populate_ref(cram_fd *fd, int id, ref_entry *r) {
         }
         r->length = st.st_size;

+        if (0){
+            int ref_fd = open(ref_fn, O_RDONLY);
+            char* mapped_file = mmap(0, st.st_size, PROT_READ, MAP_SHARED, ref_fd, 0);
+            r->seq = mapped_file;
+        }
+        else{
             r->offset = r->line_length = r->bases_per_line = 0;

             fd->refs->fn = r->fn = string_dup(fd->refs->pool, ref_fn);
@@ -2018,6 +2026,7 @@ static int cram_populate_ref(cram_fd *fd, int id, ref_entry *r) {

             fd->refs->fp = bgzf_open_ref(ref_fn, "r", 1);
         }
+    }

     free(ref_fn);

I then ran:

$ for i in `seq 1 50`; do time samtools view ref.cram &> /dev/null; done 2> unmmapped_times
<CHANGE CONDITION, RECOMPLE>
$ for i in `seq 1 50`; do time samtools view ref.cram &> /dev/null; done 2> mmapped_times
$ grep real unmmapped_times | cut -c 10-12 > unmmapped_real_times
$ grep real mmapped_times | cut -c 10-12 > mapped_real_times

Below is a summary of the times:

$ r

R version 3.5.0 (2018-04-23) -- "Joy in Playing"
Copyright (C) 2018 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin16.7.0 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> unmmapped <- read.csv(file="unmmapped_real_times")
> mmapped <- read.csv(file="mmapped_real_times")
> summary(unmmapped)
      X254
 Min.   :897.0
 1st Qu.:917.0
 Median :927.0
 Mean   :927.4
 3rd Qu.:934.0
 Max.   :966.0
> summary(mmapped)
      X643
 Min.   :612.0
 1st Qu.:621.0
 Median :641.0
 Mean   :680.6
 3rd Qu.:710.0
 Max.   :918.0
>

In this situation, it looks like mmapping the reference is quicker than reading using the previous method

ThomasHickman avatar Jul 09 '18 09:07 ThomasHickman

I'm starting to look at this again now.

I prefer @jmarshall's implementation using the 'm' field in the mode string as it's cleaner to use and doesn't require juggling of temporary buffers to prepend "mmap:" to filenames either. It's also closer to the way the mFILE code works that we'll be replacing here.

It's questionable whether we should permit mmap on all of the modes, eg append isn't going to work, but that's arguably a calling code error.

jkbonfield avatar Jul 12 '18 09:07 jkbonfield

Starting from @jmarshall's branch, I extended this to replace mFILE.c with hfiles instead. It needs more testing and is a work in progress, but this is the code so far:

https://github.com/jkbonfield/htslib/tree/m-mode-mmap

NB: still needing fixes, eg for Windows:

hfile.c: In function 'hdopen':
hfile.c:675:17: error: storage size of 'sbuf' isn't known
     struct stat sbuf;
                 ^~~~
hfile.c:676:21: warning: implicit declaration of function 'fstat' [-Wimplicit-function-declaration]
     int fstat_ok = (fstat(fd, &sbuf) == 0);
                     ^~~~~
hfile.c:676:9: warning: unused variable 'fstat_ok' [-Wunused-variable]
     int fstat_ok = (fstat(fd, &sbuf) == 0);
         ^~~~~~~~
hfile.c:675:17: warning: unused variable 'sbuf' [-Wunused-variable]
     struct stat sbuf;
                 ^~~~
make: *** [Makefile:121: hfile.o] Error 1
make: *** Waiting for unfinished jobs....
Command exited with code 2

jkbonfield avatar Jul 12 '18 15:07 jkbonfield

It's probably a good idea to implement a read function which doesn't involve copying, to actually get the speed benefit of mmaping as well. I've also got replaced mFILEs in #589, so it's probably a good idea to look at getting that merged first

ThomasHickman avatar Jul 12 '18 15:07 ThomasHickman