ga4gh-schemas
ga4gh-schemas copied to clipboard
Query only for unmapped reads
It doesn't seem like there is any support for querying only for unmapped reads. Methods that perform alignment of difficult sites might want to request only the unmapped reads and not also get all of the mapped reads.
Yes, there are definitely times that one wants only the unmapped reads.
Richard
On 30 Mar 2015, at 21:17, Louis Bergelson [email protected] wrote:
It doesn't seem like there is any support for querying only for unmapped reads. Methods that perform alignment of difficult sites might want to request only the unmapped reads and not also get all of the mapped reads.
— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/274.
The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.
@kozbo I think this feature needs to be rebooted. Currently, if you want unmapped reads you query without a reference_name and position. However, this returns all reads, useful for regenerating a BAM, but troublesome for just getting just unmapped portions.
There are two situations for unmapped reads for paired-ends.
- One Paired-End is mapped and the other is not. The unmapped is given the mapped end position but marked as unmapped.
- Both paired ends are unmapped and not position is given for both.
Both would be useful to query. The first situation would need a position range. The second would not.
On Mon, Mar 30, 2015 at 4:17 PM, Louis Bergelson [email protected] wrote:
It doesn't seem like there is any support for querying only for unmapped reads. Methods that perform alignment of difficult sites might want to request only the unmapped reads and not also get all of the mapped reads.
— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/274.
-- John Farrell, Ph.D. Biomedical Genetics-Evans 218 Boston University Medical School 72 East Concord Street Boston, MA
ph: 617-638-5491
Thanks @jjfarrell ! It seems like we might provide an enumeration of filters for flags in the search reads interface.
enum Mapping {
BOTH_MAPPED = 1;
PAIR_MAPPED = 2;
UNMAPPED = 3;
}
These are the filters in the samtools documentation.
-f INT
Only output alignments with all bits set in INT present in the FLAG field. INT can be specified in hex by beginning with `0x' (i.e. /^0x[0-9A-F]+/) or in octal by beginning with `0' (i.e. /^0[0-7]+/) [0].
-F INT
Do not output alignments with any bits set in INT present in the FLAG field. INT can be specified in hex by beginning with `0x' (i.e. /^0x[0-9A-F]+/) or in octal by beginning with `0' (i.e. /^0[0-7]+/) [0].
FLAGS:
0x1 PAIRED paired-end (or multiple-segment) sequencing technology
0x2 PROPER_PAIR each segment properly aligned according to the aligner
0x4 UNMAP segment unmapped
0x8 MUNMAP next segment in the template unmapped
0x10 REVERSE SEQ is reverse complemented
0x20 MREVERSE SEQ of the next segment in the template is reverse complemented
0x40 READ1 the first segment in the template
0x80 READ2 the last segment in the template
0x100 SECONDARY secondary alignment
0x200 QCFAIL not passing quality controls
0x400 DUP PCR or optical duplicate
0x800 SUPPLEMENTARY supplementary alignment
How to query unmapped reads (biostars)
A reads search request with the reference_id
specified will return everything along the length of that reference for the read group IDs. Although it will be possible to specify nonsensical combinations of flags and range requests, we can provide sane defaults.
In the search reads request we could provide an enumeration where the default is sensitive to the most common use cases. Something like "mapped if a reference is specified in the query, unmapped if not". It's an ugly and long statement to initialize as the default, but it allows us to have a sane default that doesn't affect current query patterns.
If the search reads request were to accept an enumeration, we could recreate the samtools functionality. It seems like this query might need to be two lists of flags, where, like pysam one can specify the nature of the filter. I think this would be a huge step forward for the community, removing the need to interpret byte codes to interact with mapping data.
An alternative might be to provide another endpoint for /unmappedreads
that assumes some state of flags being set with an option to filter by the pair's map.