bedtools
bedtools copied to clipboard
Enhancement: subtractBed: Add switch to allow -a file to be read in memory and -b from disk
Most of the time that I want to use subtractBed, the b file is very big (e.g.: 2 GiB: 66 million lines) as it contains all known snp positions.
This command is then very slow (file a contains only 10000 positions) and sometimes is unable to run as there is not enough memory to load the b file in memory:
subtractBed -a sample_snps.bed -b allknownsnps.bed
This is the error I get when I run it on a machine with 32GiB of RAM:
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
When running the command with strace:
strace -o subtractBed.strace subtractBed -a sample_snps.bed -b allknownsnps.bed
This is the end of the log:
read(3, "17\t67979430\t67979431\tG\tA\tPASS\nch"..., 8191) = 8191
brk(0x6a65bf000) = 0x6a65bf000
read(3, "7985837\tG\tC\tMinDP\nchr17\t67985847"..., 8191) = 8191
brk(0x6a65e7000) = 0x6a65e7000
read(3, "\tA\tG\tPASS\nchr17\t67992208\t6799220"..., 8191) = 8191
read(3, "\nchr17\t67999172\t67999173\tT\tG\tPAS"..., 8191) = 8191
brk(0x6a660b000) = 0x6a65e7000
mmap(NULL, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
mmap(NULL, 134217728, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = -1 ENOMEM (Cannot allocate memory)
mmap(NULL, 67108864, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = -1 ENOMEM (Cannot allocate memory)
mmap(NULL, 134217728, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = -1 ENOMEM (Cannot allocate memory)
mmap(NULL, 67108864, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = -1 ENOMEM (Cannot allocate memory)
write(2, "terminate called after throwing "..., 48) = 48
write(2, "std::bad_alloc", 14) = 14
write(2, "'\n", 2) = 2
write(2, " what(): ", 11) = 11
write(2, "std::bad_alloc", 14) = 14
write(2, "\n", 1) = 1
rt_sigprocmask(SIG_UNBLOCK, [ABRT], NULL, 8) = 0
gettid() = 26915
tgkill(26915, 26915, SIGABRT) = 0
--- SIGABRT (Aborted) @ 0 (0) ---
+++ killed by SIGABRT (core dumped) +++
# Line number lose to where the PC runs out of memory: 58469306
$ grep -n -m1 '^chr17'$'\t''67999172' allknownsnps.bed
58469306:chr17 67999172 67999173 T G PASS
# Total number of lines in the file: 66007044
$ wc -l allknownsnps.bed
66007044
It would be nice to use subtractBed like this:
subtractBed -loada -a sample_snps.bed -b allknownsnps.bed
Where -loada loads file a in memory and b from disk. So if file b is read line by line, it just needs to remove all entries from file a (that is loaded in memory) that are found in file b.
I know it is possible to mimic this behaviour with, the following, but it would be much easier if it was implemented in subtractBed directly:
intersectBed -wb -a allknownsnps.bed -b sample_snps.bed | subtractBed -a sample_snps.bed -b stdin
This is probably technically more correct (no -wb) as this allows to still use additional options for subtractBed:
intersectBed -a allknownsnps.bed -b sample_snps.bed | subtractBed -a sample_snps.bed -b stdin