delly icon indicating copy to clipboard operation
delly copied to clipboard

Only 3 chromosomes in filter germline CNVs

Open yupan-ucla opened this issue 3 years ago • 4 comments

Hi,

I used delly to regenotype hundreds of samples.

Merge genotypes using bcftools bcftools merge -m id -O b -o merged.bcf geno1.bcf ... genoN.bcf This step's output merged.bcf contains all 23 chromosomes.

However, the next step Filter for germline CNVs delly classify -f germline -o filtered.bcf merged.bcf There are only 3 chromosomes as below:

docker run -it --rm \
-v ~/delly-0.8.7:/tmp/sv \
blcdsdockerregistry/bcftools:1.11 view /tmp/sv/filtered_germline_gCNV.bcf \
> ~/gCNV_debug/filtered_germline_gCNV.csv

grep chr filtered_germline_gCNV.csv | grep -v "#" | awk -F'\t' '{print $1}' | sort | uniq

chr1
chr2
chr3

Any thoughts why this happened? Thanks.

yupan-ucla avatar Jun 17 '21 17:06 yupan-ucla

I suppose the merged.bcf file contains SV calls on all chromosomes? I would guess the delly classify step crashed, do you have the log files?

tobiasrausch avatar Jun 21 '21 14:06 tobiasrausch

Hi Tobias,

The delly classify didn't throw any errors or warns. When I classify 10 samples, I see all the chromosomes as below:

docker run -u $(id -u):$(id -g) --memory 130g -it --rm -v ${out_dir}:/out blcdsdockerregistry/bcftools:1.12 \
view /out/debug_${i}.gcnv.bcf | grep -v "#" | grep PASS | \
cut -f 1 | uniq -c > ${out_dir}/debug_${i}.uniq.chrs.txt

cat debug_10_samples.uniq.chrs.txt
    114 chr1
     84 chr2
     49 chr3
     53 chr4
     39 chr5
     39 chr6
     40 chr7
     23 chr8
     33 chr9
     31 chr10
     35 chr11
     26 chr12
     19 chr13
     15 chr14
     20 chr15
     25 chr16
     30 chr17
      8 chr18
     39 chr19
     13 chr20
     66 chr21
      9 chr22
     51 chrX

As I added 84 samples, there are only 3 chromosomes left in the output.

cat debug_84_samples.uniq.chrs.txt
    158 chr1
      9 chr2
      4 chr3

I also tried a different set of samples. In the beginning, after merging/classifying two samples, chr3 is missing. As I added more samples, the missing chromosomes increased as below:

debug_2_samples.uniq.chrs.txt missing chr 13
debug_3_samples.uniq.chrs.txt missing chr 13
debug_4_samples.uniq.chrs.txt missing chr 12 13 14 15 16 18 20 22 X
debug_5_samples.uniq.chrs.txt missing chr 12 13 14 15 16 18 19 20 22 X
debug_6_samples.uniq.chrs.txt missing chr 12 13 14 15 16 18 19 20 22 X
debug_7_samples.uniq.chrs.txt missing chr 12 14 15 16 18 20 22 X
debug_8_samples.uniq.chrs.txt missing chr 12 14 15 16 18 20 22 X
debug_9_samples.uniq.chrs.txt missing chr 12 14 15 16 18 20 22 X
debug_10_samples.uniq.chrs.txt missing chr 8 12 14 15 16 18 20 22 X
debug_20_samples.uniq.chrs.txt missing chr 8 9 12 13 14 15 16 18 20 21 22 X
debug_30_samples.uniq.chrs.txt missing chr 8 9 12 13 14 15 16 18 20 21 22 X
debug_40_samples.uniq.chrs.txt missing chr 8 9 12 13 14 15 16 18 20 21 22 X
debug_50_samples.uniq.chrs.txt missing chr 8 9 12 13 14 15 16 18 20 21 22 X
debug_60_samples.uniq.chrs.txt missing chr 8 9 12 13 14 15 16 18 20 21 22 X
debug_70_samples.uniq.chrs.txt missing chr 8 9 12 13 14 15 16 18 20 21 22 X
debug_80_samples.uniq.chrs.txt missing chr 8 9 12 13 14 15 16 18 20 21 22 X
debug_90_samples.uniq.chrs.txt missing chr 8 9 12 13 14 15 16 18 20 21 22 X
debug_100_samples.uniq.chrs.txt missing chr 8 9 12 13 14 15 16 18 20 21 22 X

Maybe this issue is related with specific samples? Any thoughts on how to confirm this? Thanks.

yupan-ucla avatar Jun 21 '21 19:06 yupan-ucla

Hi Tobias,

I'm working on the same CNV dataset. We're using the latest Delly 0.8.7 and bcftools 1.11 (also tried 1.12 but it looks like no difference). We have done some benchmarking and it appears that delly classify removes CNV variants when at least one sample has no data for GT:CN:CNL:GQ:FT:RDCN:RDSD. (i.e. ./.:.:.:.:.:.:.)

This may happen when the genomic position of a variant is identical but has different CNV IDs.

For example, in the merged BCF after genotyping (bcftools merge), I see entries like

chr13 22354889 CNV00018012
chr13 22354889 CNV00018010

and GT:CN:CNL:GQ:FT:RDCN:RDSD is empty for at least one sample. Is it normal to see entries like this?

As we merged more samples, it looks like more empty entries were generated, which were filtered out when classified as Pan showed above.

We would appreciate your response.

EDIT: bcftools sort might solve the issue and we're currently testing. We'll keep you posted.

tyamaguchi-ucla avatar Jun 23 '21 01:06 tyamaguchi-ucla

Hi,

We further looked into the issue and it seems that dropping -m id option solved the issue.

The bcftools -m id option (i.e. merge by ID - the 3rd column) probably caused the issue because each position had a slightly different ID for each sample after genotyping (Is this expected?) and merging by ID didn't merge the positions properly across samples, which caused empty FORMAT info and the lines with no FORMAT for at least one sample got removed by filtering.

Sample A

chr13   18177772 CNV00017968

SampleB

chr13   18177772 CNV00017970

Merge without -m id

chr13   18177772 CNV00017968;CNV00017970

Merge with -m id -> empty FORMAT (i.e. ./.:.:.:.:.:.:.) for sample A or B

chr13   18177772 CNV00017970
chr13   18177772 CNV00017968

Do you happen to know if there are any issues with dropping the m -id option (i.e. merging by position (and score?))? https://github.com/dellytools/delly/blob/master/README.md#germline-cnv-calling

tyamaguchi-ucla avatar Jun 23 '21 19:06 tyamaguchi-ucla