minimap2 Read index in blocks rather than pairs for efficiency

Read index in blocks rather than pairs for efficiency

Open dslarm opened this issue 2 years ago • 7 comments

This patch freads multiple pairs from the index file rather in single pairs for efficiency.

On my tests, this reduced run time during index load of a short read index of human_g1k_v37.fasta by between 5 and 10 seconds out of 20-25 seconds on AWS c6a and c7g instances.

Jun 07 '22 15:06 dslarm

Very interesting. Because fread is buffered (unlike read system call) in theory the block size should not matter much. Just to double check, are you cleaning the OS page cache during the measurements?

Jun 07 '22 15:06 hasindu2008

You're right about the difference between the read and fread, but fread still has overhead.

On some platforms that's partially due to locks (arbitrating multiple threads potentially reading the buffer) - but even where locks aren't used (single threaded x86 and recent glibc) - the overhead is noticeable.

In addition to timing on the reference genome, I also timed reading /dev/zero with a minibenchmark - the results below:

fread: Performing total bytes of 100000000 as 100000000 reads of 1 Runtime (fread): 2.9340

fread: Performing total bytes of 100000000 as 100000 reads of 1000 Runtime (fread): 0.0016

Jun 07 '22 16:06 dslarm

Could you try to increase the stdio buffer size with setvbuf?

Jun 07 '22 16:06 lh3

Good idea - but it doesn't change the performance.

You should be able to see this with this quick benchmark: compile up and just run (will read /dev/zero).

#include <fcntl.h> #include <sys/time.h> #include <sys/types.h> #include <sys/uio.h> #include <unistd.h> #include <stdio.h>

#define SHORTREAD 1 #define LONGREAD 1000 #define TOTAL 100 * 1000 * 1000

static double secs() { struct timeval tv; gettimeofday(&tv, NULL);

return (double) tv.tv_sec + (double) tv.tv_usec / (double) 10e6; }

static void bench_fread(FILE *f, int total, int read_size) { int count = total / read_size; char buffer[LONGREAD];

printf("\nfread: Performing total bytes of %d as %d reads of %d\n", total, count, read_size);

double t = secs();

for (int i = 0 ; i < count; ++i) { fread(buffer, read_size, 1, f); } printf("Runtime (fread): %.4f\n", secs() - t); }

int main(int argc, char *argv[]) { double t; char *filename;

if (argc > 1) { filename = argv[1]; } else filename = "/dev/zero";

FILE *f = fopen(filename,"r"); if (f) { bench_fread(f, TOTAL, SHORTREAD); fclose(f); f = fopen(filename,"r"); bench_fread(f, TOTAL, LONGREAD); fclose(f); } else { printf("File not found\n"); }

} `

Jun 07 '22 16:06 dslarm

I did a little test with strace on @dslarm's code with the default buffer size and large buffer size (131K) and @dslarm is right!

Did a strace (-c -w) and it seems the time spent on read system call is not the overhead.

Default buffer

sudo strace -w -c ./a.out /dev/zero 1 0
fread: Performing total bytes of 100000000 as 100000000 reads of 1
Runtime (fread): 1.9949
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 91.32    0.201489           8     24416           read
 
sudo strace -w -c ./a.out /dev/zero 1 1
fread: Performing total bytes of 100000000 as 100000 reads of 1000
Runtime (fread): 0.0472
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 91.33    0.195266           8     24416           read

Buffer size 131072

sudo strace -w -c ./a.out /dev/zero 1 2
fread: Performing total bytes of 100000000 as 100000000 reads of 1
Runtime (fread): 1.9516
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 42.53    0.013401          18       764           read

sudo strace -w -c ./a.out /dev/zero 1 3
fread: Performing total bytes of 100000000 as 100000 reads of 1000
Runtime (fread): 0.0024
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 52.44    0.012043          16       764           read

Did a perf on fread with reads of 1 and __fread_chk seem to consume a lot of CPU cycles. Locks also could be attributing the wall-clock time but I am not aware of a profiler that can profile based on wall-clock time.

    46.87%  a.out    libc-2.27.so       [.] __fread_chk
    23.81%  a.out    libc-2.27.so       [.] _IO_file_xsgetn
     7.35%  a.out    libc-2.27.so       [.] _IO_sgetn
     7.01%  a.out    libc-2.27.so       [.] __memmove_avx_unaligned_erms
     6.42%  a.out    a.out              [.] bench_fread.constprop.2
     2.22%  a.out    libc-2.27.so       [.] @plt
     2.10%  a.out    a.out              [.] __fread_chk@plt

io.c.txt

Jun 08 '22 07:06 hasindu2008

Do you need anything more from me - as this is my first PR to the project, I need a maintainer to approve.

Jun 24 '22 17:06 dslarm

Hi - @lh3 are you able to merge this or give me commit rights to merge it?

Sep 17 '22 06:09 dslarm

@dslarm the performance is certainly much better but the index stats are not the same for a given .mmi

master

[M::mm_idx_stat::27.208*1.00] distinct minimizers: 232561365 (8.84% are singletons); average occurrences: 2.607; average spacing: 10.476; total length: 6351814707
...
Mapped/Unmapped/Short/Masked/Skipped(all matches masked): 75454/14546/0/0/0

[M::mm_idx_stat::8.746*1.00] distinct minimizers: 29071172 (8.85% are singletons); average occurrences: 2.647; average spacing: 82.553; total length: 6351814707
...
Mapped/Unmapped/Short/Masked/Skipped(all matches masked): 69158/20842/0/0/0

Mar 13 '23 22:03 iiSeymour

@dslarm fread returns the number items so ints_read should just be bytes_read.

distinct minimizers: 232561365
distinct minimizers:  29071172

You're doing 8x less work - with this accounted for I don't see any performance benefits in my testing.

Mar 14 '23 13:03 iiSeymour

Thank you @iiSeymour for the testing. And thank you @dslarm anyway for the PR. I am closing this now.

Apr 21 '23 14:04 lh3

minimap2 minimap2 copied to clipboard

Read index in blocks rather than pairs for efficiency

minimap2
minimap2 copied to clipboard