usher icon indicating copy to clipboard operation
usher copied to clipboard

extend protobuf to include list of masked sites so that usher can apply consistent masking

Open AngieHinrichs opened this issue 4 years ago • 2 comments

When it's time to extend the protobuf format, we should add a list/array of masked sites, so that usher can perform the same masking when the protobuf is initially created and when samples are added to it, and the user can find out which sites are masked for a particular protobuf.

When a protobuf is initially created from Newick + VCF, usher can take an optional input file with just a list of masked positions (very easy to extract from a file such as the Problematic Sites VCF or other source), ignore all mutations at those positions, and save the positions in the protobuf. Then when new samples are added to a protobuf, usher can ignore all mutations in the new samples at those positions.

Currently it's up to the user to make sure that the same positions are masked when creating the protobuf and when placing new samples. That's awkward for the web interface; if we update the Problematic Sites track in the Genome Browser (used by the web interface), then older protobufs may include mutations at some of the sites that are now masked in uploaded fasta/VCF. If the masking is performed by usher consistently, then that won't be a problem any more.

AngieHinrichs avatar Jun 23 '21 01:06 AngieHinrichs

I think this makes sense and should involve expanding matUtils mask to work with masked sites, probably both to apply new masking to a given protobuf ala what Angie describes here and to dump currently masked sites or remove masking from an input protobuf.

jmcbroome avatar Jun 23 '21 18:06 jmcbroome

@jmcbroome and @yatisht any progress on this? I think it will be useful especially for applications to bigger/uglier genomes where there are many difficult to align and call regions.

russcd avatar Sep 02 '21 16:09 russcd