ipyrad icon indicating copy to clipboard operation
ipyrad copied to clipboard

VCF output: depth info corrupted for denovo GBS

Open dereneaton opened this issue 7 years ago • 19 comments

Reported on gitter.im

denovo

locus_3925 142 . G A,T 13 PASS NS=4;DP=33 GT:DP:CATG 0/2:9:1,1,3,4 0/1:7:7,0,0,0 0/2:9:0,1,3,5 ./.:0:0,0,0,0 ./.:0:0,0,0,0 ./.:0:0,0,0,0 ./.:0:0,0,0,0 ./.:0:0,0,0,0 ./.:0:0,0,0,0 ./.:0:0,0,0,0 ./.:0:0,0,0,0 3/1:8:3,4,0,1

I also get malformed CATG fields (missing count): locus_17982 155 . A C 13 PASS NS=12;DP=12794

GT:DP:CATG 1/0:946:154,791,1,0 0/0:1091:141,950,0,0 0/0:1127:152,974,1,0 0/0:1114:138,975,1,0 1/0:1110:761,348,0,1 0/0:841:0,0,1,840 1/0:679:475,203,1,0 1/0:977:664,312,1,0 0/0:1367:0,0,3,1364 0/0:1055:0,0,0,1055 0/0:1264:140,1123,1, 0/0:1223:160,1060,3,

dereneaton avatar Feb 22 '17 16:02 dereneaton

I am also having a similar issue but for SE RAD data. For some reason, I have heterozygous genotypes in my VCF with base counts for only one nucleotide:

0/1:34:0,34,0,0

The proportion of these weird genotypes is rather small on this particular VCF, but I don't know if this site should be considered homozygous or heterozygous.

ODiogoSilva avatar Mar 01 '17 16:03 ODiogoSilva

I can confirm this, on Single end GBS data. I have found a BLAST hit to a mitochondrial sequence, so I was naturally expecting no heterozygous on this locus. However, since there were some heterozygous in there, I went to the VCF file and found some "legitimate" heterozygous (individuals 11 and 13, for example) and some inconsistently called heterozygous genotypes (the last individual, for example):

18107	30	.	C	T	13	PASS	.	GT:DP:CATG	0/0:8:8,0,0,0	0/0:41:41,0,0,0	0/0:105:104,0,0,1	0/0:60:60,0,0,0	1/1:45:0,0,0,45	0/0:9:9,0,0,0	0/0:18:18,0,0,0	0/0:62:62,0,0,0	0/0:39:39,0,0,0	0/0:152:152,0,0,0	1/0:74:27,0,47,0	./.:0:0,0,0,0	1/0:56:32,0,24,0	1/1:15:15,0,0,0	1/0:66:66,0,0,0	1/1:64:1,0,62,1	0/0:82:82,0,0,0	./.:0:0,0,0,0	0/0:91:91,0,0,0	0/0:19:0,0,0,19	0/0:12:12,0,0,0	1/0:96:96,0,0,0	0/0:10:10,0,0,0	0/0:37:37,0,0,0	0/0:35:35,0,0,0	0/0:67:67,0,0,0	1/0:10:10,0,0,0	0/0:14:0,0,0,14	0/0:17:0,0,0,17	1/1:25:25,0,0,0	./.:0:0,0,0,0	./.:0:0,0,0,0	0/0:54:0,54,0,0	1/1:36:0,0,35,1	0/0:58:58,0,0,0	0/0:44:44,0,0,0	0/0:122:122,0,0,0	0/0:55:55,0,0,0	0/0:16:16,0,0,0	0/0:83:83,0,0,0	0/0:83:0,83,0,0	0/0:36:36,0,0,0	0/0:41:41,0,0,0	0/0:29:29,0,0,0	1/0:64:64,0,0,0	0/0:36:36,0,0,0	1/0:94:0,94,0,0	0/0:10:0,0,0,10	0/0:69:69,0,0,0	0/0:79:79,0,0,0	0/0:53:53,0,0,0	1/0:46:46,0,0,0	0/0:45:45,0,0,0	0/0:130:130,0,0,0	0/0:10:10,0,0,0	0/0:16:16,0,0,0	./.:0:0,0,0,0	0/0:11:11,0,0,0	0/0:9:9,0,0,0	0/0:43:43,0,0,0	0/0:19:19,0,0,0	0/0:43:43,0,0,0	0/0:16:16,0,0,0	0/0:36:36,0,0,0	0/0:45:45,0,0,0	0/0:90:0,89,1,0	1/0:119:119,0,0,0	0/0:20:20,0,0,0	1/0:91:44,0,47,0	0/0:42:42,0,0,0	1/0:157:106,0,50,1	./.:0:0,0,0,0	0/0:65:65,0,0,0	./.:0:0,0,0,0	1/0:91:91,0,0,0	./.:0:0,0,0,0	1/0:64:64,0,0,0	0/0:42:42,0,0,0	0/0:70:0,70,0,0	0/0:42:42,0,0,0	1/0:85:85,0,0,0

Look at the last individual, for example: It is reported as an heterozygous genotype (1/0), but base counts show 85,0,0,0. This is unfortunately occurring on approximately 10% of my genotypes - at least this kind of error. I did not spot any reverse error (heterozygous base counts being reported as homozygous), but they might be around.

For reference, here is the corresponding section from the .loci file.

Arg16     ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTG-----
Arg19     ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGC-
Arg4      ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGAT---
Arg5      ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGCA
Bul11     ACCAAATGTATAGGCCGCAGGAAACATGGTGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGCA
Bul16     ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTG-----
Bul37     ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGCA
Bul5      ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGCA
Cat15     ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGCA
Cat3      ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGCA
Cat6      ACCAAATGTATAGGCCGCAGGAAACATGGYGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGC-
Cat8      ACCAAATGTATAGGCCGCAGGAAACATGGYGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATG--
Cor11     ACCAAATGTATAGGCCGCAGGAAACATGGTGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATG--
Cor12     ACCAAATGTATAGGCCGCAGGAAACATGGYGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGCA
Cor14     ACCAAATGTATAGGCCGCAGGAAACATGGTGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGA----
Cor17     ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGC-
Cor7      ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATG--
Haz10     ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGCA
Haz16     ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTG-----
Haz2      ACCAAATGTATAGGCCGCAGGAAACATGGYGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGCA
Haz24     ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTG-----
Haz9      ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGC-
Ken11     ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGCA
Ken6      ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGC-
Ken9      ACCAAATGTATAGGCCGCAGGAAACATGGYGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTG-----
Lan10     ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGCA
Lan11     ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGCA
Lan14     ACCAAATGTATAGGCCGCAGGAAACATGGTGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGCA
Mon12     ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGCA
Mon13     ACCAAATGTATAGGCCGCAGGAAACATGGTGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGC-
Mon26     ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGC-
Mon9      ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGC-
Pug10     ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGCA
Pug12     ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGC-
Pug5      ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTG-----
Pug7      ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGCA
Pug8      -CCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGCA
Pug9      ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGC-
Sar15     ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGA----
Sar16     ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGC-
Sar23     ACCAAATGTATAGGCCGCAGGAAACATGGYGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGCA
Sar4      ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGCA
Sar7      ACCAAATGTATAGGCCGCAGGAAACATGGYGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGCA
Sar9      ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGCA
Sic13     ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGC-
Sic4      ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGC-
Sic5      ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGCA
Sin21     ACCAAATGTATAGGCCGCAGGAAACATGGYGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGCA
Sin33     ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGC-
Sin6      ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGCA
Taz17     ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTG-----
Taz2      ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGAT---
Taz8      ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTG-----
Tol17     ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATT---------
Tol27     ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGC-
Tol3      ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGCA
Tol4      ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGCA
Tol6      ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTG-----
Tos12     ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGCA
Tos15     ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGCA
Tos22     -CCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGCA
Tos3      ACCAAATGTATAGGCCGCAGGAAACATGGYGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGCA
Tos7      ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTG-----
Tos8      ACCAAATGTATAGGCCGCAGGAAACATGGYGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGAT---
Tun1      ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGC-
Tun23     ACCAAATGTATAGGCCGCAGGAAACATGGYGCRGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTA--------
Tun6      ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGCA
Tun9      ACCAAATGTATAGGCCGCAGGAAACATGGYGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGCA
Var16     ACCAAATGTATAGGCCGCAGGAAACATGGYGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGCA
Var17     ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGCA
Var19     -CCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGCA
Var21     ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGCA
Var4      ACCAAATGTATAGGCCGCAGGAAACATGGYGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGCA
//                                     *  -                                             |18106|

The .loci file also shows the last individual as an heterozygote, which is consisten with the vcf genotype. This means that is the error is in the genotype calling and not the base counts, it's coming downstream of this (Im not sure how to post the hdf5 file here).

This was done on ipyrad v0.5.15.

StuntsPT avatar Mar 01 '17 16:03 StuntsPT

I just re-ran using ipyrad 0.6.8 and the issue remains...

StuntsPT avatar Mar 02 '17 13:03 StuntsPT

I have not been able to reproduce this for either SE RAD or PE ddRAD on real data. @StuntsPT Can you dropbox me the hdf5 file? Or better a chunk of the raw data and the params file you used.

isaacovercast avatar Mar 04 '17 23:03 isaacovercast

@isaacovercast I sent you a link to the hdf5 file via gitter private chat. This is SE GBS data.

StuntsPT avatar Mar 04 '17 23:03 StuntsPT

tldr;: Yes the genotype counts are wrong in the vcf file, but the genotypes themselves are correct. Heterozygous sites ARE correct even if the counts are wrong.

Got good news and bad news. The good news is the consens.hdf5 file is intact and looking good. The other good news is this means that all vcf files we've been creating up to now are still ok, in terms of the actual genotypes. The bad news is the output of make_vcf must be fscking something and I'm not super familiar with that code so it might take some time to fix. Still not sure exactly what's happening or why his is doing this, but here's the output for this snp for two individuals, from the vcf output above the "-11" individual has the good catg info and the "16" individual has the homozygous looking catg:

>>> io5["catgs"][37200][-11][34]
array([106,   0,  50,   1], dtype=uint32)
>>> io5["catgs"][37200][16][34]
array([47,  0, 19,  0], dtype=uint32)
>>> io5["seqs"][37200][16][34]
'Y'
>>> io5["seqs"][37200][-11][34]
'Y'

isaacovercast avatar Mar 05 '17 00:03 isaacovercast

The "bad" individual has the incorrect genotype counts (" [66, 0, 0, 0]") in the neighborhood of the bad base:

>>> io5["catgs"][37200][16][30:40]
array([[ 0, 66,  0,  0],
       [ 0,  0, 66,  0],
       [ 0,  0,  0, 66],
       [ 0,  0,  0, 66],
       [47,  0, 19,  0],
       [ 0,  0,  1, 65],
       [66,  0,  0,  0],
       [ 0,  0,  0, 66],
       [ 0,  0,  0, 66],
       [ 0,  0, 66,  0]], dtype=uint32)

A different individual with the bad base has the hetero genotype counts farther away.

>>> catg[82][34:49]                                                                                                                                
array([[83,  0,  0,  0],
       [83,  0,  0,  0],
       [83,  0,  0,  0],
       [ 0,  0, 83,  0],
       [ 0,  0,  0, 83],
       [84,  0,  0,  0],
       [84,  0,  0,  0],
       [ 0,  0, 84,  0],
       [ 0,  0,  0, 84],
       [ 0, 84,  0,  0],
       [84,  0,  0,  0],
       [84,  0,  0,  0],
       [ 0,  0,  0, 84],
       [84,  0,  0,  0],
       [ 0, 27,  0, 57]], dtype=uint32)

Another sample with a bad base (the hetero genotype is base 34, recall:

>>> catg[23][34:49]
array([[96,  0,  0,  0],
       [97,  0,  0,  0],
       [97,  0,  0,  0],
       [ 1,  0, 96,  0],
       [ 1,  0,  0, 96],
       [97,  0,  0,  0],
       [97,  0,  0,  0],
       [ 0,  0, 97,  0],
       [ 0,  0,  0, 97],
       [ 0, 97,  1,  0],
       [99,  0,  0,  0],
       [99,  0,  0,  0],
       [ 0,  0,  0, 99],
       [99,  0,  0,  0],
       [ 0, 43,  0, 56]], dtype=uint32)

isaacovercast avatar Mar 05 '17 00:03 isaacovercast

~The .loci writer may be acting up too, since the issue is also revealed there...~ Sleep deprivation led to this. Forget it as it made no sense.

StuntsPT avatar Mar 05 '17 00:03 StuntsPT

I haven't had a chance to look at this yet, but one idea is that the problem may be related to indels, which are imputed into the catg array after step6 aligning. If an indel is not properly imputed then the base counts would not line up properly with the base calls.

eaton-lab avatar Mar 05 '17 01:03 eaton-lab

Also NB: the actual cluster number for this locus inside the clust.hdf5 file is 37200, and not 18106 as reported in the .loci file, this took me a while to figure out. 32700 is a conspicuously round number, perhaps an optim edge case? Though my instinct was something about indels as well.

isaacovercast avatar Mar 05 '17 01:03 isaacovercast

That's great news. It means I can carry on with my analyses! The read counts are not vital, I was just worried that the genotypes could be incorrectly called.

StuntsPT avatar Mar 05 '17 22:03 StuntsPT

Just for the record, I have provided via private gitter.im message the required sample files to reproduce the issue.

StuntsPT avatar Mar 05 '17 22:03 StuntsPT

omg how is this still an issue? I had hoped the update to v0.9 had fixed this, but apparently not.

This was reported on gitter and I got the user to share data necessary to reproduce:

RAD_0 6 loc0_pos5 C T 13 PASS NS=114;DP=3165 GT:DP:CATG 0/0:20:20,0,0,0 0/0:44:0,0,44,0 0/0:16:16,0,0,0 0/0:25:0,0,25,0 0/0:21:0,0,21,0 0/0:15:15,0,0,0 0/0:8:0,0,8,0

isaacovercast avatar Mar 15 '21 11:03 isaacovercast

There appears to be no rhyme or reason to the depth issue. It's not a sample issue because I looked at 2 different samples and in both of them sometimes the catgs line up with the sequence and sometimes they don't. It's not locus specific because within any given locus I observe samples with both good and bad catg data. It appears unrelated to indels, because I can see sometimes a sample at a locus with tons of indels still has correct catg info, and sometimes a sample at a locus with complete sequence (except 2 indels at the end) has bad catg info.

isaacovercast avatar Mar 15 '21 12:03 isaacovercast

Simulated data appears unaffected (naturally).

Pedicularis data also seems unaffected. I checked about 10 random sequences for three different samples and they were all fine.

isaacovercast avatar Mar 15 '21 12:03 isaacovercast

So, in consens_se.py the catgs are calculated from self.arrayed on line 709. The sequence that gets written out comes from self.consens. self.arrayed is created inside build_consens_and_array and then self.consens is generated from arrayed by base_caller, then a bunch of other stuff happens, and somewhere in there arrayed and consens diverge. Here are all the places arrayed gets manipulated:

Line 611:

            self.consens = self.consens[ltrim:rtrim + 1]
            self.arrayed = self.arrayed[:, ltrim:rtrim + 1]

Line 637 (inside mask_repeats):

        # apply filter
        self.consens = self.consens[keep]
        self.arrayed = self.arrayed[:, keep]

This all looks legit.

storealleles manipulates consens, but I can see that this problem crops up both for sequences with 1 and with 2 alleles, so it can't be this.

isaacovercast avatar Mar 16 '21 19:03 isaacovercast

FUCK! I cracked it. It's GBS.

Here is a bad base:

RAD_9   8       loc9_pos7       G       T       13      PASS    NS=2;DP=26      GT:DP:CATG      1/1:9:0,0,9,0   0/0:17:0,0,17,0 ./.:0:0,0,0,0   ./.:0:0,0,0,0

Here is the locus in the clust_database.fa:

>S197batch2_10079
TAATATTTCTTATAGTTGTGTCAGATTAGAACAATCTGCTTGTTGTTTGGTGTAGTTATGACTTCAGAAACATAGAAGAAGACAGGCAGTTGCATTTGTTTTGTCAGCAATTGTCTGGCAGTTTCTAGTTGGACAGAGACTGATnnnnGCAGTTTCTAGTTGGACAGAGACTGATTAGAAGAATCTGCTTGTTGTTTGGTATAGTTATAACTTCAGAAACTAAGAAGAAGATAGGCAGTTGCATTTGTTTTGTCAGCCATTTTCTGGCAGTTTGTAGCTGGACGGAGACTTA
>S226zAbatch2_6418
TAATATTTCTGATAGTTGTGTCAGATTAGAACAATCTGCTTGTTGTTTGGCGTAGTTATGACTTCAGAAACATAGAAGAAGACAGGCAGTTGCATTTGTTTTGTCAGCAATTGTCTGGCAGTTTCTAGTTGGACAGAGACAGATnnnnGCAGTTTCTAGTTGGACAGAGACAGATTAGAAGAATCTGCTTGTTGTTTGGTATAGTTATAACTTCAGAAACTAAGAAGAAGATAGGCAGTTGCATTTGTTTTGTCAGCCATTTTCTGGCAGTTTGTAGCTGGACGGAGACTTA

Here are the first 17 bases from the catg for S197batch2 (matches the proximal end of the sequence in the clust db):

[[0 0 9 0]
 [0 9 0 0]
 [0 9 0 0]
 [0 0 9 0]
 [0 9 0 0]
 [0 0 9 0]
 [0 0 9 0]
 [0 0 9 0]
 [9 0 0 0]
 [0 0 8 1]
 [0 0 9 0]
 [0 9 0 0]
 [0 0 9 0]
 [0 9 0 0]
 [0 0 0 9]
 [0 0 9 0]
 [0 0 9 0]]

Here are the first 17 bases from the catg for S226zAbatch2 (revcomp matches the last 17 bases of the distal end of the sequence in the clust db):

[[ 0  0 17  0]
 [ 0 17  0  0]
 [ 0 17  0  0]
 [ 0  0  0 17]
 [ 0  0 17  0]
 [17  0  0  0]
 [ 0  0 17  0]
 [17  0  0  0]
 [17  0  0  0]
 [ 0  0  0 17]
 [ 0  0 17  0]
 [17  0  0  0]
 [17  0  0  0]
 [ 0 17  0  0]
 [ 0  0  0 17]
 [17  0  0  0]
 [ 0  0 17  0]]

Still don't know how to fix it, but damn, i'm super happy to have figured it out and it seems like it might be an easy fix.....

isaacovercast avatar Mar 16 '21 20:03 isaacovercast

SHIIIIIIIIITTTTTTTTTTTTTTTTTTTTTTTTT! It's not easy to fix. In clustmap_across.py GBS data gets clustered on both strands, so sequences can 'flip' to match the strand of the seed, which obviously flips the orientation of the sequence and breaks the connection with the catg info constructed during step 5. Damn.

Essentially the depth info is getting fsck for GBS data because the catg depths are recorded during step 5 and then during clustering in step 6 the consensus sequences can get randomly re-oriented to match the strand of the seed, which breaks the connection with the orientation of the depths in the catg file.

isaacovercast avatar Mar 16 '21 20:03 isaacovercast

The fix for this will be to parse the utemp file during step 6 and step through each catg file for each sample and reverse the depth info for hits that match on the negative strand. This only needs to be done for GBS data, and probably should happen at the end of step 6. This will be super annoying, and a lot of work, so I will file this under "will fix when someone pays me to do it." ;p

isaacovercast avatar Mar 17 '21 11:03 isaacovercast