genmap
genmap copied to clipboard
Computing only binary mappability
I think it is oftentimes useful to have a binary mappability of data, when you are only interested in completely unique regions. Up until now, I have been just calculating the mappability and then using another tool to floor the values, or cast them to ints, to obtain only 1's and 0's. However, it would probably be more efficient if this were possible inherently in GenMap.
I would imagine this is a fairly straightforward thing to implement? And I guess it would also make GenMap run a bit faster?
The other nice side effect is that the files will be much smaller, as a BED/wig file with values of just 1's can be written much more succinctly than a file with decimal values ranging from 0 to 1.
Hi Josh,
theoretically speaking you could replace the std::vector<uint8_t>
with a bitvector. While a vector can be read and written in parallel, the bitvector can't. If two threads write to the same region (e.g. 64-bit integer), you get into trouble without locking. I don't think you will see any speedup.
If RAM is not a limitation, the easiest solution would be to create a binary vector before writing it to disk. Wig and bed files should work out of the box, dumping it into a binary format might have to be adjusted.
If it is not urgent, I would rather consider to include it when porting to SeqAn3.
Not urgent at all.
Interesting point about bitvectors not being able to be read/written in parallel. I didn't know that! Is that a seqan3 limitation or a hardware thing?
The C++ standard says the following about containers (§ 23.2.2):
Notwithstanding (17.6.5.9), implementations are required to avoid data races when the contents of the contained object in different elements in the same container, excepting vector
, are modified concurrently.
Most implementations of std::vector