Support returning unmapped reads
pysam's AlignmentFile.fetch, which we use to implement the reads/search endpoint, does not return unmapped reads. Presumably clients would be interested in receiving unmapped reads, however.
@dcolligan: There is nuance here. AlignmentFile.fetch will return an unmapped reads if another segment is mapped (see section 2, item 4.1 of the SAM specification). If all segments in the template are unmapped, those reads are obtained by fetching from tid=-1.
@bioinformed thanks for the heads up. Your input is always appreciated.
What is the correct behaviour here according to the spec? What are the conditions under which we're supposed to return unmapped reads?
In the schemas repo, SearchReadsRequest.referenceId has this comment:
The reference to query. Leaving blank returns results from all
references, including unmapped reads - this could be very large.
and SearchReadsResponse.alignments has this comment:
The list of matching alignment records, sorted by position.
Unmapped reads, which have no position, are returned last.
If/when we support unmapped reads, this will introduce some complexity for our paging algorithm since all unmapped reads have POS = 0 and are not returned by the regular fetch call.
OK, so at the moment, we're not quite implementing the protocol as defined? Perhaps we should insist on having a reference, and throw a NotImplemented("Unmapped reads not currently supported") if someone makes a query without specifying a reference? That way at least we are correctly implementing the subset of the schema that we support and are explicit about not supporting the rest of it.
That sounds like a good stopgap measure. Could you file an issue?
Also, would there be any situation in which a client would want only unmapped reads? If so, the protocol does not provide for this -- we only have (as specified) mapped + unmapped (that is, all reads) and mapped for a particular reference.
Getting only unmapped reads is an important use cases (search for viral DNA associated with cancer for example).
I seem to remember this being discussed, but don't know if there is an issues for it.
Great example of why one has to implement an API for it to be real.
Danny Colligan [email protected] writes:
That sounds like a good stopgap measure. Could you file an issue?
Also, would there be any situation in which a client would want only unmapped reads? If so, the protocol does not provide for this -- we only have (as specified) mapped + unmapped (that is, all reads) and mapped for a particular reference.
— Reply to this email directly or view it on GitHub.*
@dcolligan --- done, see issue #496.
@diekhans --- this searchReads query is very complex as it is. Perhaps a separate API call to get unmapped reads would be better than trying to funnel everything through this one interface?