genmap icon indicating copy to clipboard operation
genmap copied to clipboard

Computing only binary mappability

Open joshuak94 opened this issue 5 years ago • 4 comments

I think it is oftentimes useful to have a binary mappability of data, when you are only interested in completely unique regions. Up until now, I have been just calculating the mappability and then using another tool to floor the values, or cast them to ints, to obtain only 1's and 0's. However, it would probably be more efficient if this were possible inherently in GenMap.

I would imagine this is a fairly straightforward thing to implement? And I guess it would also make GenMap run a bit faster?

joshuak94 avatar Jul 22 '19 11:07 joshuak94

The other nice side effect is that the files will be much smaller, as a BED/wig file with values of just 1's can be written much more succinctly than a file with decimal values ranging from 0 to 1.

joshuak94 avatar Jul 22 '19 11:07 joshuak94

Hi Josh,

theoretically speaking you could replace the std::vector<uint8_t> with a bitvector. While a vector can be read and written in parallel, the bitvector can't. If two threads write to the same region (e.g. 64-bit integer), you get into trouble without locking. I don't think you will see any speedup.

If RAM is not a limitation, the easiest solution would be to create a binary vector before writing it to disk. Wig and bed files should work out of the box, dumping it into a binary format might have to be adjusted.

If it is not urgent, I would rather consider to include it when porting to SeqAn3.

cpockrandt avatar Jul 30 '19 15:07 cpockrandt

Not urgent at all.

Interesting point about bitvectors not being able to be read/written in parallel. I didn't know that! Is that a seqan3 limitation or a hardware thing?

joshuak94 avatar Jul 31 '19 07:07 joshuak94

The C++ standard says the following about containers (§ 23.2.2):

Notwithstanding (17.6.5.9), implementations are required to avoid data races when the contents of the contained object in different elements in the same container, excepting vector, are modified concurrently.

Most implementations of std::vector will store the bit-vector in an array of integers. If you try to set/unset bits concurrently that are stored in the same integer value, you might run into problems.

cpockrandt avatar Aug 01 '19 03:08 cpockrandt