pairtools
pairtools copied to clipboard
Identical read_len1 and read_len2 reported with different real read lengths in R1&2 pairs
I have some data with different lengths of read1 (81 nt) and read2 (221 nt) and I want to report the length of the read in the pairs. I add
--add-columns pos5,pos3,read_len,mapq \
to parse2, but in the pairs, I think depending on walk pair type, the read length is sometimes reported incorrectly. In the case of R1-2 reads both values are reported, in case R1 or R2 reads only the appropriate value is reported. I think this is correct behaviour, but perhaps should be clarified in the docs?
However, in R1&2 pairs both read lengths are reported as 81 nt, while I would expect both 81 and 221 nt reported. This is not what I would expect in this situation, is this intended behaviour?
It looks like your use case implies that read lengths for left and right alignments in the pair should have their lengths reported separately. Do you think that would be more expected behavior if there were read_len1 and read_len2?
There are read_len1 and read_len2 for each pair already, and they are reported separately - and correctly in case of R1-2 pairs. Just seems like in case of R1&2 it always reports the value from read1.
Yeah, I think it's because when the pair is present on both R1 and R2, we report all the properties of this pair on read1. It's not only about the read length, but all the properties of the alignment. Reporting all the info about the alignments originating from both read pairs is a logistic nightmare, and I do not see the reason why read_length2 shall reflect the properties of read 2 instead of read2. I would just accept this fundamental ambiguity of R1&2 pairs and don't consider reporting R1 info as incorrect.
Will be glad to hear other opinions, the line block responsible for this implementation starts here: https://github.com/open2c/pairtools/blob/9a0b894d0f65ec388a753862b072e90e53dacbd8/pairtools/lib/parse.py#L1093-L1105
If this is very painful to implement, I understand, no problem. But still I think in principle it would be better/cleaner and less confusing... Off the top of my head I can't imagine a scenario where it actually makes a practical difference, but possibly it exists.
Yes, for now we have to make choice whether to report info for only one read if the same pair is present on both. You can always report readIDs of R1&2 and make a more thorough analysis of these reads.
Maybe having an option to change between read1 or read2 reporting is a way to go?
Another thought: maybe there can be an option choice where pairtools would report stacked info for read1 and read2 (e.g., comma-separated)? It's not particularly elegant, but if you really need it, you at least can have both in the .pairs file.