bcftools icon indicating copy to clipboard operation
bcftools copied to clipboard

gtcheck reports wrong SAMPLE ID and --plot further modifies that

Open arkal opened this issue 4 years ago • 3 comments

Hi,

This issue raises 2 issues related to the SAMPLE ID column in the output generated by gtcheck SETUP I have a query vcf with 3 samples, and a BCF of pooled genotypes with 41 samples. I want to identify the samples in the target that match those in my query.

This is the command I ran

bcftools gtcheck  -s CLUST0 -g pooled_genotypes.bcf --plot foo SAMPLE21.vcf.gz

Some additional info:

[arrao@n34 SNP_calling]$ bcftools --version
bcftools 1.10.2
Using htslib 1.10.2
Copyright (C) 2019 Genome Research Ltd.
License Expat: The MIT/Expat license
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
[arrao@n34 SNP_calling]$ gcc --version
gcc (GCC) 5.5.0
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

ISSUE 1: The order of SAMPLE ID in the output is not the same as the order in the input file.

Here's the contents of foo.tab. The number after SAMPLE in the Sample column is the order in which the samples appear in pooled_genotypes.bcf so you can clearly see that the SAMPLE ID is not concordant with the order of appearance in the input file.

# This file was produced by bcftools (1.10.2+htslib-1.10.2), the command line was:
# 	 bcftools gtcheck  -s CLUST0 -g pooled_genotypes.bcf --plot foo SAMPLE21.vcf.gz
# and the working directory was:
# 	 /krummellab/data1/arrao/projects/MVIR1/SNP_calling
#
# [1]CN	[2]Discordance with CLUST0 (total)	[3]Discordance (avg score per site)	[4]Number of sites compared	[5]Sample	[6]Sample ID
CN	1.354197e+05	1.164304e+05	56523	SAMPLE25	0
CN	1.328193e+05	1.150803e+05	56088	SAMPLE10	1
CN	1.324549e+05	1.145297e+05	56203	SAMPLE35	2
CN	1.322396e+05	1.131039e+05	56819	SAMPLE37	3
CN	1.312854e+05	1.135609e+05	56182	SAMPLE40	4
CN	1.311655e+05	1.132718e+05	56274	SAMPLE41	5
CN	1.310871e+05	1.159929e+05	54921	SAMPLE9	6
CN	1.294849e+05	1.126935e+05	55838	SAMPLE8	7
CN	1.293130e+05	1.115253e+05	56348	SAMPLE27	8
CN	1.290580e+05	1.170206e+05	53596	SAMPLE33	9
CN	1.288820e+05	1.148341e+05	54542	SAMPLE36	10
CN	1.279940e+05	1.135764e+05	54766	SAMPLE6	11
CN	1.267735e+05	1.109098e+05	55548	SAMPLE29	12
CN	1.264362e+05	1.157098e+05	53102	SAMPLE31	13
CN	1.247447e+05	1.174030e+05	51636	SAMPLE12	14
CN	1.246350e+05	1.179485e+05	51352	SAMPLE7	15
CN	1.245969e+05	1.193723e+05	50724	SAMPLE23	16
CN	1.243550e+05	1.181021e+05	51170	SAMPLE15	17
CN	1.243322e+05	1.086077e+05	55633	SAMPLE18	18
CN	1.234790e+05	1.033448e+05	58065	SAMPLE34	19
CN	1.227538e+05	1.057801e+05	56395	SAMPLE38	20
CN	1.226531e+05	1.178329e+05	50585	SAMPLE4	21
CN	1.223182e+05	1.084589e+05	54807	SAMPLE1	22
CN	1.210414e+05	1.072289e+05	54857	SAMPLE39	23
CN	1.203345e+05	1.036972e+05	56394	SAMPLE11	24
CN	1.201919e+05	1.109227e+05	52658	SAMPLE30	25
CN	1.198450e+05	1.068389e+05	54513	SAMPLE26	26
CN	1.195165e+05	1.017118e+05	57104	SAMPLE32	27
CN	1.192589e+05	1.180084e+05	49112	SAMPLE16	28
CN	1.186695e+05	1.066657e+05	54066	SAMPLE24	29
CN	1.181527e+05	1.095567e+05	52410	SAMPLE5	30
CN	1.159676e+05	1.116109e+05	50494	SAMPLE14	31
CN	1.087883e+05	1.092854e+05	48376	SAMPLE22	32
CN	1.066748e+05	1.182527e+05	43839	SAMPLE20	33
CN	1.065064e+05	1.295950e+05	39939	SAMPLE2	34
CN	1.063215e+05	1.307483e+05	39518	SAMPLE19	35
CN	1.037806e+05	1.287279e+05	39179	SAMPLE3	36
CN	9.863743e+04	1.252708e+05	38265	SAMPLE17	37
CN	9.287959e+04	1.354197e+05	33331	SAMPLE13	38
CN	9.062675e+04	1.237411e+05	35592	SAMPLE28	39
CN	3.322398e+04	6.753899e+04	23906	SAMPLE21	40

ISSUE 2: the python script for plotting sorts the data affecting the SAMPLE ID yet again. Running python foo.py generates this plot foo

The x axis is called SAMPLE ID and it ranges from 0-40 but it is very obviously the reverse of SAMPLE ID in foo.tab. In fact, it isn't really even related to SAMPLE ID in the tab file since the csv reader doesn't even consider the column with SAMPLE ID and just sorts the data based on Discordance with CLUST0.

Line 20 of foo.py

dat = sorted(dat)

can be replaced with

dat = sorted(dat, reverse=True)

to ensure that SAMPLE ID in the tab and the png are the same since it looks like the tab is sorted by decreasing Discordance with CLUST0 and the value of SAMPLE ID goes from 0 to num_samples-1 and the default behaviour of sorted is to sort in an increasing order. This won't cover edge cases obviously and it's probably best to just use Sample as the x axis instead.

Here's the contents of foo.py generated by the command

import matplotlib as mpl
mpl.use('Agg')
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import csv
csv.register_dialect('tab', delimiter='\t', quoting=csv.QUOTE_NONE)

sample_ids = False

dat = []
with open('foo.tab', 'r') as f:
    reader = csv.reader(f, 'tab')
    for row in reader:
        if row[0][0]=='#': continue
        if row[0]!='CN': continue
        tgt = 0
        if row[4]=='': tgt = 1
        dat.append([float(row[1]), float(row[2]), float(row[3]), tgt, row[4]])

dat = sorted(dat)

iq = -1; dp = 0
for i in range(len(dat)):
    if iq==-1 and dat[i][3]==1: iq = i
    dp += dat[i][2]
dp /= len(dat)

fig,ax1 = plt.subplots(figsize=(8,5))
ax2 = ax1.twinx()
plots  = ax1.plot([x[0] for x in dat],'o-', ms=3, color='g', mec='g', label='Discordance (total)')
plots += ax1.plot([x[1] for x in dat], '^', ms=3, color='r', mec='r', label='Discordance (avg per site)')
plots += ax2.plot([x[2] for x in dat],'v', ms=3, color='k', label='Number of sites')
if iq!=-1:
   ax1.plot([iq],[dat[iq][0]],'o',color='orange', ms=9)
   ax1.annotate('',xy=(iq,dat[iq][0]), xytext=(5,5), textcoords='offset points',fontsize='xx-small',rotation=45,va='bottom',ha='left')
   ax1.plot([iq],[dat[iq][1]],'^',color='red', ms=5)
for tl in ax1.get_yticklabels(): tl.set_color('g')
for tl in ax2.get_yticklabels(): tl.set_color('k'); tl.set_fontsize(9)
min_dp = min([x[2] for x in dat])
max_dp = max([x[2] for x in dat])
ax2.set_ylim(min_dp-1,max_dp+1)
ax1.set_title('Discordance with CLUST0')
ax1.set_xlim(-0.05*len(dat),1.05*(len(dat)-1))
ax1.set_xlabel('Sample ID')
plt.subplots_adjust(left=0.1,right=0.9,bottom=0.1,top=0.9)
if sample_ids:
   ax1.set_xticks(range(len(dat)))
   ax1.set_xticklabels([x[4] for x in dat],**{'rotation':45, 'ha':'right', 'fontsize':8})
   plt.subplots_adjust(bottom=0.2)
ax1.set_ylabel('Discordance',color='g')
ax2.set_ylabel('Number of sites',color='k')
ax2.ticklabel_format(style='sci', scilimits=(-3,2), axis='y')
ax1.ticklabel_format(style='sci', scilimits=(-3,2), axis='y')
labels = [l.get_label() for l in plots]
plt.legend(plots,labels,numpoints=1,markerscale=1,loc='best',prop={'size':10},frameon=False)
plt.savefig('foo.png')
plt.close()

arkal avatar Jun 05 '20 01:06 arkal

Thank you for this issue. As far as I remember, the sample IDs did not try to match the order of samples in the input file, but it is still possible that it is a bug. Coincidentally, gtcheck is undergoing a major revamp at the moment. Could you please try with the latest version that is to be merged into the develop branch soon?

git clone git://github.com/samtools/htslib.git
git clone --branch gtcheck https://github.com/samtools/bcftools.git
cd bcftools
autoheader && autoconf && ./configure
make

Note that the plotting functionality has not been carried over yet and the usage and the output is a little bit different, simpler I hope.

pd3 avatar Jun 05 '20 08:06 pd3

Sure can do. I assume this needs htslib1.10.2 ?

arkal avatar Jun 05 '20 18:06 arkal

The commands to download and compile are given above, including the download of the latest htslib. See here http://samtools.github.io/bcftools/howtos/install.html for more

pd3 avatar Jun 08 '20 06:06 pd3