delly
delly copied to clipboard
Only 3 chromosomes in filter germline CNVs
Hi,
I used delly to regenotype hundreds of samples.
Merge genotypes using bcftools bcftools merge -m id -O b -o merged.bcf geno1.bcf ... genoN.bcf This step's output merged.bcf contains all 23 chromosomes.
However, the next step Filter for germline CNVs delly classify -f germline -o filtered.bcf merged.bcf There are only 3 chromosomes as below:
docker run -it --rm \
-v ~/delly-0.8.7:/tmp/sv \
blcdsdockerregistry/bcftools:1.11 view /tmp/sv/filtered_germline_gCNV.bcf \
> ~/gCNV_debug/filtered_germline_gCNV.csv
grep chr filtered_germline_gCNV.csv | grep -v "#" | awk -F'\t' '{print $1}' | sort | uniq
chr1
chr2
chr3
Any thoughts why this happened? Thanks.
I suppose the merged.bcf file contains SV calls on all chromosomes? I would guess the delly classify step crashed, do you have the log files?
Hi Tobias,
The delly classify didn't throw any errors or warns. When I classify 10 samples, I see all the chromosomes as below:
docker run -u $(id -u):$(id -g) --memory 130g -it --rm -v ${out_dir}:/out blcdsdockerregistry/bcftools:1.12 \
view /out/debug_${i}.gcnv.bcf | grep -v "#" | grep PASS | \
cut -f 1 | uniq -c > ${out_dir}/debug_${i}.uniq.chrs.txt
cat debug_10_samples.uniq.chrs.txt
114 chr1
84 chr2
49 chr3
53 chr4
39 chr5
39 chr6
40 chr7
23 chr8
33 chr9
31 chr10
35 chr11
26 chr12
19 chr13
15 chr14
20 chr15
25 chr16
30 chr17
8 chr18
39 chr19
13 chr20
66 chr21
9 chr22
51 chrX
As I added 84 samples, there are only 3 chromosomes left in the output.
cat debug_84_samples.uniq.chrs.txt
158 chr1
9 chr2
4 chr3
I also tried a different set of samples. In the beginning, after merging/classifying two samples, chr3 is missing. As I added more samples, the missing chromosomes increased as below:
debug_2_samples.uniq.chrs.txt missing chr 13
debug_3_samples.uniq.chrs.txt missing chr 13
debug_4_samples.uniq.chrs.txt missing chr 12 13 14 15 16 18 20 22 X
debug_5_samples.uniq.chrs.txt missing chr 12 13 14 15 16 18 19 20 22 X
debug_6_samples.uniq.chrs.txt missing chr 12 13 14 15 16 18 19 20 22 X
debug_7_samples.uniq.chrs.txt missing chr 12 14 15 16 18 20 22 X
debug_8_samples.uniq.chrs.txt missing chr 12 14 15 16 18 20 22 X
debug_9_samples.uniq.chrs.txt missing chr 12 14 15 16 18 20 22 X
debug_10_samples.uniq.chrs.txt missing chr 8 12 14 15 16 18 20 22 X
debug_20_samples.uniq.chrs.txt missing chr 8 9 12 13 14 15 16 18 20 21 22 X
debug_30_samples.uniq.chrs.txt missing chr 8 9 12 13 14 15 16 18 20 21 22 X
debug_40_samples.uniq.chrs.txt missing chr 8 9 12 13 14 15 16 18 20 21 22 X
debug_50_samples.uniq.chrs.txt missing chr 8 9 12 13 14 15 16 18 20 21 22 X
debug_60_samples.uniq.chrs.txt missing chr 8 9 12 13 14 15 16 18 20 21 22 X
debug_70_samples.uniq.chrs.txt missing chr 8 9 12 13 14 15 16 18 20 21 22 X
debug_80_samples.uniq.chrs.txt missing chr 8 9 12 13 14 15 16 18 20 21 22 X
debug_90_samples.uniq.chrs.txt missing chr 8 9 12 13 14 15 16 18 20 21 22 X
debug_100_samples.uniq.chrs.txt missing chr 8 9 12 13 14 15 16 18 20 21 22 X
Maybe this issue is related with specific samples? Any thoughts on how to confirm this? Thanks.
Hi Tobias,
I'm working on the same CNV dataset. We're using the latest Delly 0.8.7 and bcftools 1.11 (also tried 1.12 but it looks like no difference).
We have done some benchmarking and it appears that delly classify
removes CNV variants when at least one sample has no data for GT:CN:CNL:GQ:FT:RDCN:RDSD
. (i.e. ./.:.:.:.:.:.:.
)
This may happen when the genomic position of a variant is identical but has different CNV IDs.
For example, in the merged BCF after genotyping (bcftools merge
), I see entries like
chr13 22354889 CNV00018012
chr13 22354889 CNV00018010
and GT:CN:CNL:GQ:FT:RDCN:RDSD
is empty for at least one sample. Is it normal to see entries like this?
As we merged more samples, it looks like more empty entries were generated, which were filtered out when classified as Pan showed above.
We would appreciate your response.
EDIT: bcftools sort
might solve the issue and we're currently testing. We'll keep you posted.
Hi,
We further looked into the issue and it seems that dropping -m id
option solved the issue.
The bcftools -m id
option (i.e. merge by ID - the 3rd column) probably caused the issue because each position had a slightly different ID for each sample after genotyping (Is this expected?) and merging by ID didn't merge the positions properly across samples, which caused empty FORMAT
info and the lines with no FORMAT
for at least one sample got removed by filtering.
Sample A
chr13 18177772 CNV00017968
SampleB
chr13 18177772 CNV00017970
Merge without -m id
chr13 18177772 CNV00017968;CNV00017970
Merge with -m id
-> empty FORMAT
(i.e. ./.:.:.:.:.:.:.
) for sample A or B
chr13 18177772 CNV00017970
chr13 18177772 CNV00017968
Do you happen to know if there are any issues with dropping the m -id
option (i.e. merging by position (and score?))? https://github.com/dellytools/delly/blob/master/README.md#germline-cnv-calling