ga4gh-schemas icon indicating copy to clipboard operation
ga4gh-schemas copied to clipboard

Query only for unmapped reads

Open lbergelson opened this issue 9 years ago • 4 comments

It doesn't seem like there is any support for querying only for unmapped reads. Methods that perform alignment of difficult sites might want to request only the unmapped reads and not also get all of the mapped reads.

lbergelson avatar Mar 30 '15 20:03 lbergelson

Yes, there are definitely times that one wants only the unmapped reads.

Richard

On 30 Mar 2015, at 21:17, Louis Bergelson [email protected] wrote:

It doesn't seem like there is any support for querying only for unmapped reads. Methods that perform alignment of difficult sites might want to request only the unmapped reads and not also get all of the mapped reads.

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/274.

The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

richarddurbin avatar Mar 30 '15 22:03 richarddurbin

@kozbo I think this feature needs to be rebooted. Currently, if you want unmapped reads you query without a reference_name and position. However, this returns all reads, useful for regenerating a BAM, but troublesome for just getting just unmapped portions.

david4096 avatar Jan 09 '17 18:01 david4096

There are two situations for unmapped reads for paired-ends.

  1. One Paired-End is mapped and the other is not. The unmapped is given the mapped end position but marked as unmapped.
  2. Both paired ends are unmapped and not position is given for both.

Both would be useful to query. The first situation would need a position range. The second would not.

On Mon, Mar 30, 2015 at 4:17 PM, Louis Bergelson [email protected] wrote:

It doesn't seem like there is any support for querying only for unmapped reads. Methods that perform alignment of difficult sites might want to request only the unmapped reads and not also get all of the mapped reads.

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/274.

-- John Farrell, Ph.D. Biomedical Genetics-Evans 218 Boston University Medical School 72 East Concord Street Boston, MA

ph: 617-638-5491

jjfarrell avatar Jan 09 '17 19:01 jjfarrell

Thanks @jjfarrell ! It seems like we might provide an enumeration of filters for flags in the search reads interface.

enum Mapping {
  BOTH_MAPPED = 1;
  PAIR_MAPPED = 2;
  UNMAPPED = 3;
}

These are the filters in the samtools documentation.


-f INT

    Only output alignments with all bits set in INT present in the FLAG field. INT can be specified in hex by beginning with `0x' (i.e. /^0x[0-9A-F]+/) or in octal by beginning with `0' (i.e. /^0[0-7]+/) [0]. 

-F INT

    Do not output alignments with any bits set in INT present in the FLAG field. INT can be specified in hex by beginning with `0x' (i.e. /^0x[0-9A-F]+/) or in octal by beginning with `0' (i.e. /^0[0-7]+/) [0]. 

 FLAGS:
0x1	PAIRED	paired-end (or multiple-segment) sequencing technology
0x2	PROPER_PAIR	each segment properly aligned according to the aligner
0x4	UNMAP	segment unmapped
0x8	MUNMAP	next segment in the template unmapped
0x10	REVERSE	SEQ is reverse complemented
0x20	MREVERSE	SEQ of the next segment in the template is reverse complemented
0x40	READ1	the first segment in the template
0x80	READ2	the last segment in the template
0x100	SECONDARY	secondary alignment
0x200	QCFAIL	not passing quality controls
0x400	DUP	PCR or optical duplicate
0x800	SUPPLEMENTARY	supplementary alignment

How to query unmapped reads (biostars)

A reads search request with the reference_id specified will return everything along the length of that reference for the read group IDs. Although it will be possible to specify nonsensical combinations of flags and range requests, we can provide sane defaults.

In the search reads request we could provide an enumeration where the default is sensitive to the most common use cases. Something like "mapped if a reference is specified in the query, unmapped if not". It's an ugly and long statement to initialize as the default, but it allows us to have a sane default that doesn't affect current query patterns.

If the search reads request were to accept an enumeration, we could recreate the samtools functionality. It seems like this query might need to be two lists of flags, where, like pysam one can specify the nature of the filter. I think this would be a huge step forward for the community, removing the need to interpret byte codes to interact with mapping data.

An alternative might be to provide another endpoint for /unmappedreads that assumes some state of flags being set with an option to filter by the pair's map.

david4096 avatar Jan 09 '17 21:01 david4096