minimap2
minimap2 copied to clipboard
Read index in blocks rather than pairs for efficiency
This patch freads multiple pairs from the index file rather in single pairs for efficiency.
On my tests, this reduced run time during index load of a short read index of human_g1k_v37.fasta by between 5 and 10 seconds out of 20-25 seconds on AWS c6a and c7g instances.
Very interesting. Because fread is buffered (unlike read system call) in theory the block size should not matter much. Just to double check, are you cleaning the OS page cache during the measurements?
You're right about the difference between the read and fread, but fread still has overhead.
On some platforms that's partially due to locks (arbitrating multiple threads potentially reading the buffer) - but even where locks aren't used (single threaded x86 and recent glibc) - the overhead is noticeable.
In addition to timing on the reference genome, I also timed reading /dev/zero with a minibenchmark - the results below:
fread: Performing total bytes of 100000000 as 100000000 reads of 1 Runtime (fread): 2.9340
fread: Performing total bytes of 100000000 as 100000 reads of 1000 Runtime (fread): 0.0016
Could you try to increase the stdio buffer size with setvbuf?
Good idea - but it doesn't change the performance.
You should be able to see this with this quick benchmark: compile up and just run (will read /dev/zero).
`
#include <fcntl.h> #include <sys/time.h> #include <sys/types.h> #include <sys/uio.h> #include <unistd.h> #include <stdio.h>
#define SHORTREAD 1 #define LONGREAD 1000 #define TOTAL 100 * 1000 * 1000
static double secs() { struct timeval tv; gettimeofday(&tv, NULL);
return (double) tv.tv_sec + (double) tv.tv_usec / (double) 10e6; }
static void bench_fread(FILE *f, int total, int read_size) { int count = total / read_size; char buffer[LONGREAD];
printf("\nfread: Performing total bytes of %d as %d reads of %d\n", total, count, read_size);
double t = secs();
for (int i = 0 ; i < count; ++i) { fread(buffer, read_size, 1, f); } printf("Runtime (fread): %.4f\n", secs() - t); }
int main(int argc, char *argv[]) { double t; char *filename;
if (argc > 1) { filename = argv[1]; } else filename = "/dev/zero";
FILE *f = fopen(filename,"r"); if (f) { bench_fread(f, TOTAL, SHORTREAD); fclose(f); f = fopen(filename,"r"); bench_fread(f, TOTAL, LONGREAD); fclose(f); } else { printf("File not found\n"); }
} `
I did a little test with strace on @dslarm's code with the default buffer size and large buffer size (131K) and @dslarm is right!
Did a strace (-c -w) and it seems the time spent on read
system call is not the overhead.
Default buffer
sudo strace -w -c ./a.out /dev/zero 1 0
fread: Performing total bytes of 100000000 as 100000000 reads of 1
Runtime (fread): 1.9949
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
91.32 0.201489 8 24416 read
sudo strace -w -c ./a.out /dev/zero 1 1
fread: Performing total bytes of 100000000 as 100000 reads of 1000
Runtime (fread): 0.0472
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
91.33 0.195266 8 24416 read
Buffer size 131072
sudo strace -w -c ./a.out /dev/zero 1 2
fread: Performing total bytes of 100000000 as 100000000 reads of 1
Runtime (fread): 1.9516
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
42.53 0.013401 18 764 read
sudo strace -w -c ./a.out /dev/zero 1 3
fread: Performing total bytes of 100000000 as 100000 reads of 1000
Runtime (fread): 0.0024
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
52.44 0.012043 16 764 read
Did a perf
on fread with reads of 1 and __fread_chk seem to consume a lot of CPU cycles. Locks also could be attributing the wall-clock time but I am not aware of a profiler that can profile based on wall-clock time.
46.87% a.out libc-2.27.so [.] __fread_chk
23.81% a.out libc-2.27.so [.] _IO_file_xsgetn
7.35% a.out libc-2.27.so [.] _IO_sgetn
7.01% a.out libc-2.27.so [.] __memmove_avx_unaligned_erms
6.42% a.out a.out [.] bench_fread.constprop.2
2.22% a.out libc-2.27.so [.] @plt
2.10% a.out a.out [.] __fread_chk@plt
Do you need anything more from me - as this is my first PR to the project, I need a maintainer to approve.
Hi - @lh3 are you able to merge this or give me commit rights to merge it?
@dslarm the performance is certainly much better but the index stats are not the same for a given .mmi
master
[M::mm_idx_stat::27.208*1.00] distinct minimizers: 232561365 (8.84% are singletons); average occurrences: 2.607; average spacing: 10.476; total length: 6351814707
...
Mapped/Unmapped/Short/Masked/Skipped(all matches masked): 75454/14546/0/0/0
vs
[M::mm_idx_stat::8.746*1.00] distinct minimizers: 29071172 (8.85% are singletons); average occurrences: 2.647; average spacing: 82.553; total length: 6351814707
...
Mapped/Unmapped/Short/Masked/Skipped(all matches masked): 69158/20842/0/0/0
@dslarm fread
returns the number items so ints_read
should just be bytes_read
.
distinct minimizers: 232561365
distinct minimizers: 29071172
You're doing 8x less work - with this accounted for I don't see any performance benefits in my testing.
Thank you @iiSeymour for the testing. And thank you @dslarm anyway for the PR. I am closing this now.