bcftools
bcftools copied to clipboard
gtcheck reports wrong SAMPLE ID and --plot further modifies that
Hi,
This issue raises 2 issues related to the SAMPLE ID
column in the output generated by gtcheck
SETUP
I have a query vcf with 3 samples, and a BCF of pooled genotypes with 41 samples. I want to identify the samples in the target that match those in my query.
This is the command I ran
bcftools gtcheck -s CLUST0 -g pooled_genotypes.bcf --plot foo SAMPLE21.vcf.gz
Some additional info:
[arrao@n34 SNP_calling]$ bcftools --version
bcftools 1.10.2
Using htslib 1.10.2
Copyright (C) 2019 Genome Research Ltd.
License Expat: The MIT/Expat license
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
[arrao@n34 SNP_calling]$ gcc --version
gcc (GCC) 5.5.0
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
ISSUE 1: The order of SAMPLE ID in the output is not the same as the order in the input file.
Here's the contents of foo.tab
. The number after SAMPLE in the Sample column is the order in which the samples appear in pooled_genotypes.bcf
so you can clearly see that the SAMPLE ID is not concordant with the order of appearance in the input file.
# This file was produced by bcftools (1.10.2+htslib-1.10.2), the command line was:
# bcftools gtcheck -s CLUST0 -g pooled_genotypes.bcf --plot foo SAMPLE21.vcf.gz
# and the working directory was:
# /krummellab/data1/arrao/projects/MVIR1/SNP_calling
#
# [1]CN [2]Discordance with CLUST0 (total) [3]Discordance (avg score per site) [4]Number of sites compared [5]Sample [6]Sample ID
CN 1.354197e+05 1.164304e+05 56523 SAMPLE25 0
CN 1.328193e+05 1.150803e+05 56088 SAMPLE10 1
CN 1.324549e+05 1.145297e+05 56203 SAMPLE35 2
CN 1.322396e+05 1.131039e+05 56819 SAMPLE37 3
CN 1.312854e+05 1.135609e+05 56182 SAMPLE40 4
CN 1.311655e+05 1.132718e+05 56274 SAMPLE41 5
CN 1.310871e+05 1.159929e+05 54921 SAMPLE9 6
CN 1.294849e+05 1.126935e+05 55838 SAMPLE8 7
CN 1.293130e+05 1.115253e+05 56348 SAMPLE27 8
CN 1.290580e+05 1.170206e+05 53596 SAMPLE33 9
CN 1.288820e+05 1.148341e+05 54542 SAMPLE36 10
CN 1.279940e+05 1.135764e+05 54766 SAMPLE6 11
CN 1.267735e+05 1.109098e+05 55548 SAMPLE29 12
CN 1.264362e+05 1.157098e+05 53102 SAMPLE31 13
CN 1.247447e+05 1.174030e+05 51636 SAMPLE12 14
CN 1.246350e+05 1.179485e+05 51352 SAMPLE7 15
CN 1.245969e+05 1.193723e+05 50724 SAMPLE23 16
CN 1.243550e+05 1.181021e+05 51170 SAMPLE15 17
CN 1.243322e+05 1.086077e+05 55633 SAMPLE18 18
CN 1.234790e+05 1.033448e+05 58065 SAMPLE34 19
CN 1.227538e+05 1.057801e+05 56395 SAMPLE38 20
CN 1.226531e+05 1.178329e+05 50585 SAMPLE4 21
CN 1.223182e+05 1.084589e+05 54807 SAMPLE1 22
CN 1.210414e+05 1.072289e+05 54857 SAMPLE39 23
CN 1.203345e+05 1.036972e+05 56394 SAMPLE11 24
CN 1.201919e+05 1.109227e+05 52658 SAMPLE30 25
CN 1.198450e+05 1.068389e+05 54513 SAMPLE26 26
CN 1.195165e+05 1.017118e+05 57104 SAMPLE32 27
CN 1.192589e+05 1.180084e+05 49112 SAMPLE16 28
CN 1.186695e+05 1.066657e+05 54066 SAMPLE24 29
CN 1.181527e+05 1.095567e+05 52410 SAMPLE5 30
CN 1.159676e+05 1.116109e+05 50494 SAMPLE14 31
CN 1.087883e+05 1.092854e+05 48376 SAMPLE22 32
CN 1.066748e+05 1.182527e+05 43839 SAMPLE20 33
CN 1.065064e+05 1.295950e+05 39939 SAMPLE2 34
CN 1.063215e+05 1.307483e+05 39518 SAMPLE19 35
CN 1.037806e+05 1.287279e+05 39179 SAMPLE3 36
CN 9.863743e+04 1.252708e+05 38265 SAMPLE17 37
CN 9.287959e+04 1.354197e+05 33331 SAMPLE13 38
CN 9.062675e+04 1.237411e+05 35592 SAMPLE28 39
CN 3.322398e+04 6.753899e+04 23906 SAMPLE21 40
ISSUE 2: the python script for plotting sorts the data affecting the SAMPLE ID
yet again.
Running python foo.py
generates this plot
The x axis is called SAMPLE ID
and it ranges from 0-40 but it is very obviously the reverse of SAMPLE ID
in foo.tab. In fact, it isn't really even related to SAMPLE ID
in the tab file since the csv reader doesn't even consider the column with SAMPLE ID
and just sorts the data based on Discordance with CLUST0
.
Line 20 of foo.py
dat = sorted(dat)
can be replaced with
dat = sorted(dat, reverse=True)
to ensure that SAMPLE ID
in the tab and the png are the same since it looks like the tab is sorted by decreasing Discordance with CLUST0
and the value of SAMPLE ID
goes from 0 to num_samples-1
and the default behaviour of sorted
is to sort in an increasing order. This won't cover edge cases obviously and it's probably best to just use Sample
as the x axis instead.
Here's the contents of foo.py generated by the command
import matplotlib as mpl
mpl.use('Agg')
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import csv
csv.register_dialect('tab', delimiter='\t', quoting=csv.QUOTE_NONE)
sample_ids = False
dat = []
with open('foo.tab', 'r') as f:
reader = csv.reader(f, 'tab')
for row in reader:
if row[0][0]=='#': continue
if row[0]!='CN': continue
tgt = 0
if row[4]=='': tgt = 1
dat.append([float(row[1]), float(row[2]), float(row[3]), tgt, row[4]])
dat = sorted(dat)
iq = -1; dp = 0
for i in range(len(dat)):
if iq==-1 and dat[i][3]==1: iq = i
dp += dat[i][2]
dp /= len(dat)
fig,ax1 = plt.subplots(figsize=(8,5))
ax2 = ax1.twinx()
plots = ax1.plot([x[0] for x in dat],'o-', ms=3, color='g', mec='g', label='Discordance (total)')
plots += ax1.plot([x[1] for x in dat], '^', ms=3, color='r', mec='r', label='Discordance (avg per site)')
plots += ax2.plot([x[2] for x in dat],'v', ms=3, color='k', label='Number of sites')
if iq!=-1:
ax1.plot([iq],[dat[iq][0]],'o',color='orange', ms=9)
ax1.annotate('',xy=(iq,dat[iq][0]), xytext=(5,5), textcoords='offset points',fontsize='xx-small',rotation=45,va='bottom',ha='left')
ax1.plot([iq],[dat[iq][1]],'^',color='red', ms=5)
for tl in ax1.get_yticklabels(): tl.set_color('g')
for tl in ax2.get_yticklabels(): tl.set_color('k'); tl.set_fontsize(9)
min_dp = min([x[2] for x in dat])
max_dp = max([x[2] for x in dat])
ax2.set_ylim(min_dp-1,max_dp+1)
ax1.set_title('Discordance with CLUST0')
ax1.set_xlim(-0.05*len(dat),1.05*(len(dat)-1))
ax1.set_xlabel('Sample ID')
plt.subplots_adjust(left=0.1,right=0.9,bottom=0.1,top=0.9)
if sample_ids:
ax1.set_xticks(range(len(dat)))
ax1.set_xticklabels([x[4] for x in dat],**{'rotation':45, 'ha':'right', 'fontsize':8})
plt.subplots_adjust(bottom=0.2)
ax1.set_ylabel('Discordance',color='g')
ax2.set_ylabel('Number of sites',color='k')
ax2.ticklabel_format(style='sci', scilimits=(-3,2), axis='y')
ax1.ticklabel_format(style='sci', scilimits=(-3,2), axis='y')
labels = [l.get_label() for l in plots]
plt.legend(plots,labels,numpoints=1,markerscale=1,loc='best',prop={'size':10},frameon=False)
plt.savefig('foo.png')
plt.close()
Thank you for this issue. As far as I remember, the sample IDs did not try to match the order of samples in the input file, but it is still possible that it is a bug. Coincidentally, gtcheck
is undergoing a major revamp at the moment. Could you please try with the latest version that is to be merged into the develop branch soon?
git clone git://github.com/samtools/htslib.git
git clone --branch gtcheck https://github.com/samtools/bcftools.git
cd bcftools
autoheader && autoconf && ./configure
make
Note that the plotting functionality has not been carried over yet and the usage and the output is a little bit different, simpler I hope.
Sure can do. I assume this needs htslib1.10.2 ?
The commands to download and compile are given above, including the download of the latest htslib. See here http://samtools.github.io/bcftools/howtos/install.html for more