Stop producing empty alleles
make_prg can produce sites with i) direct deletions (eg REF AT, ALT "") ii) direct insertions (eg REF "", ALT AT). I refer to REF as the first allele in the site, that's how we embed a 'reference' in gramtools.
Though @mbhall88 has rightly pointed a site made by this tool does not have to translate to one in pandora/gramtools, I argue if we fix this problem here, there's no need to deal with it there. This is especially relevant for gramtools as by default each site produced here is a variant site in the output of genotype. It is also important since vcf spec (https://github.com/samtools/hts-specs/blob/master/VCFv4.3.pdf section 1.6.1) states neither REF nor ALT should be empty.
I will have a look at how to fix this
So the issue here is to do with multiple levels of nesting which can make it really hard to "get the previous letter and add to both". E.g. suppose that the allele is at level 3, but the bubble its embedded in has no string before the start of the nested split. Gets very messy with the recursion very quickly.
On the other hand, in pandora at least it's pretty easy to fix this so that there are no empty string alleles once the template vcf has been created from the graph and given the vcf reference.
If you can find a fix, great!
Sent from my Samsung Galaxy smartphone.
-------- Original message -------- From: Brice Letcher [email protected] Date: 11/08/2020 11:16 (GMT+00:00) To: rmcolq/make_prg [email protected] Cc: Subscribed [email protected] Subject: [rmcolq/make_prg] Stop producing empty alleles (#17)
make_prg can produce sites with i) direct deletions (eg REF AT, ALT "") ii) direct insertions (eg REF "", ALT AT). I refer to REF as the first allele in the site, that's how we embed a 'reference' in gramtools.
Though @mbhall88https://github.com/mbhall88 has rightly pointed a site made by this tool does not have to translate to one in pandora/gramtools, I argue if we fix this problem here, there's no need to deal with it there. This is especially relevant for gramtools as by default each site produced here is a variant site in the output of genotype. It is also important since vcf spec (https://github.com/samtools/hts-specs/blob/master/VCFv4.3.pdf section 1.6.1) states neither REF nor ALT should be empty.
I will have a look at how to fix this
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/rmcolq/make_prg/issues/17, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACLIWO5R7CYRES6ONOXHXNLSAEK7NANCNFSM4P237U7A.
The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
Thanks for pointing this out @rmcolq , indeed it looks like we can't easily prepend/postpend sequence to non-match intervals to guarantee the recursive clustering won't eventually hit an empty sequence. However i'd like to at least try to enforce that at 'level 1' we can have that guarantee- WIP!