dysgu icon indicating copy to clipboard operation
dysgu copied to clipboard

Best protocol about dysgu run/merge with WGS 35X from multiple patients? Calling cohorts

Open Taalouane opened this issue 2 years ago • 7 comments

Hi,

I'm working with Human WGS 35X data from multiple patients.

First, I call SV variants with dysgu run (in 26 samples below example for sample 1):

dysgu run \
  --procs 12 \
  --mode pe --pl pe \
  --diploid True \
  --drop-gaps True \
  --max-cov auto \
  --min-support 5 \
  --mq 20 \
  --exclude ${blacklist} \
  --min-size 50 \
  --verbosity 2 \
  ${hg38} \
  ${output}/tmp1 \
  ${input}/sample1.bam > ${output_run}/sample1.SV.vcf

Then, I merge the samples into a unified site list:

dysgu merge \
${output_run}/sample1.SV.vcf  \
${output_run}/sample2.SV.vcf ... \
> ${output_merge}/merged.vcf

I re-genotype at the sample level:

dysgu run --sites ${output_merge}/merged.vcf  ${hg38} tmp1 sample1.bam > ${output_geno}/sample1.re_geno.vcf
dysgu run --sites ${output_merge}/merged.vcf  ${hg38} tmp2 sample1.bam > ${output_geno}/sample2.re_geno.vcf
....

Finally, I merge re-genotyped samples:

dysgu merge \
${output_geno}/sample1.re_geno.vcf  \
${output_geno}/sample2.re_geno.vcf... \
> ${output_merge}/merged.re_geno.vcf

My questions are: 1) Based on WGS with 35X (Illumina paired-end), what better recommended value for "--min-support"? 2) Is the last step (merge regenotyped samples) correct? because is not mentioned in the doc https://github.com/kcleal/dysgu/blob/master/README.rst 3) Do you have any specific recommendations for calling germline variants?

Thanks for any help you can provide me on this protocol,

Best, Tarek

Taalouane avatar May 09 '22 12:05 Taalouane

Hi @Taalouane

I think your approach seem reasonable.

Only one thing that I can suggest - where you re-genotype at the sample level, you should probably explicitly remove the target sample from the merged.vcf for each re-genotype run. I realise the documentation is not clear on this point currently. bcftools can be used for this, e.g. after generating merged.vcf:

# generate a list of samples in the vcf
bcftools query -l merged.vcf > sample_list.txt

# perform re-genotyping
i=1
while read p; do
    # here we remove the current sample from the merged vcf, and drop records with no genotype
    bcftools view -Ov -s $(sed "${i}d" sample_list.txt | sed 'N;s/\n/,/') merged.vcf | \
            bcftools view -Ov -i'GT!="0"' - > tmp.vcf
   # bam file name needs filling out here
    dysgu call -x --sites tmp.vcf $HG38 wd_deleteme $p.bam > $p.regeno.bam 
    let "i=i+1"
done < sample_list.txt

dysgu does not do this internally as things can get complicated if merging across platforms of when feeding in vcf's generated by different callers.

To answer your other questions:

  1. The --min-support value can be left at the default value or 3 for paired-end data, it is usually more accurate to filter on the PROB field post calling, than adjusting --min-support.
  2. Merging the regenotyped samples is optional, it just depends if you would like to know where the shared sites are in your cohort
  3. The default options are currently set up for calling germline samples, so I can recommend sticking with the defaults.

Kez

kcleal avatar May 10 '22 13:05 kcleal

Thank you so much Kez for your detailed answer !! I'll run the pipeline again with the proposed changes and get back to you (I leave the issue open).

Taalouane avatar May 10 '22 15:05 Taalouane

Hi @kcleal, I am a bit confused regarding your proposed workflow. If I understand it correctly, you delete the supporting information of the sample to genotype to avoid kind of having artificial "double support" for a variant? However, from a practical perspective, this means that we would also remove the ID and positional information for the deleted Variants and that this reduces accuracy in merging the regenotyped samples. Wouldn't it be better to only set all quality metrics tags in the INFO and sample fields to missing and by this keep the positional information of the variant? Right now, the regenotyped vcfs contain different SVs, which causes problems when merging...

Best, Johannes

johannesgeibel avatar Aug 15 '22 09:08 johannesgeibel

Hi @johannesgeibel, I think I understand what you mean. The original workflow aims to try and avoid double counting the original variant, as you say. The input sites shouldn't influence the positions of newly called variants, but only influence the resulting probabilities of the newly called events. Internally, input sites get associated with candidate SVs and only after probabilities are assigned, then the associated sites are used to modify those values. Additionally, candidate SVs with an associated site are less likely to get dropped during calling, so you may end up with more SVs by adding in some sites. Perhaps this explains the different calls you see? The bit of code that might cause this is line 1210 in cluster.pyx, where candidate events are kept if ::

event.su >= args["min_support"] or event.site_info

So candidate events that are associated with a site can be kept even if they have low support. What are the problems with merging that you refer to?

kcleal avatar Aug 15 '22 10:08 kcleal

Hi @kcleal, The issue with merging means, that I commonly use bcftools based on the variant ID for merging after forced genotyping to ensure that we merge the variants we genotyped. If I have now different callsets after forced genotyping, merging (also using other tools) should result in ungenotyped sites again and might also modify previously derived start and end points of variants. But if I understood you (and the documentation) correctly, when setting --parse-probs True , --all-sites True, dropping the sample we genotype from the vcf and setting INFO:MeanPROB to 0.5 for the variants, we would have dropped otherwise, this would create the force-call behaviour without double support (+ some additional calls, which we could drop)?

johannesgeibel avatar Aug 15 '22 11:08 johannesgeibel

I think this sounds reasonable. Additionally, I apologize, but I found a bug when I was reviewing the code - I realized the option --parse-probs was always being treated as True. I have fixed this in v1.3.13, so now --parse-probs works as expected. I hope this will not cause you too many problems - the differences between True/False are generally quite small anyway.

kcleal avatar Aug 16 '22 09:08 kcleal

Hi @kcleal , thanks for the answer and sorry for bothering you again. I played around with the merging and came across another issue. I can work around it, but it might be worth considering to add a fix in the future. When force-calling and requesting to output all variants in the sites file (--all-sites True), the 0/0 genotypes are returned without setting ID and REF and most dsygu specific tags are ".". E.g.:

chr1    53      .       .       <DUP>   .       .       SVMETHOD=DYSGUv1.3.12;SVTYPE=DUP;END=1285;CHR2=chr1;GRP=.;NGRP=.;CT=.;CIPOS95=.;CIEND95=.;SVLEN=1231;CONTIGA=.;CONTIGB=.;KIND=.;REP=0;REPSC=0;GC=.;NEXP=.;STRIDE=.;EXPSEQ=.;RPOLY=.;OL=.;SU=0;WR=0;PE=0;SR=0;SC=0;BND=0;LPREC=.;RT=.    GT:GQ:NMP:NMS:NMB:MAPQP:MAPQS:NP:MAS:SU:WR:PE:SR:SC:BND:SQC:SCW:SQR:BE:COV:MCOV:LNK:NEIGH:NEIGH10:RB:PS:MS:SBT:NG:NSA:NXA:NMU:NDC:RMS:RED:BCC:FCC:STL:RAS:FAS:ICN:OCN:CMP:RR:JIT:PROB    0/0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0.123

This means that merging based on the ID won't work. I therefore merged using dysgu merge, but there seems to be a bug, when Tags are .:

Traceback (most recent call last):
  File "/scratch/users/geibel/miniconda3/envs/dysgu/lib/python3.7/site-packages/dysgu/view.py", line 330, in vcf_to_df
    df[k] = df[k].astype(dtype)
  File "/scratch/users/geibel/miniconda3/envs/dysgu/lib/python3.7/site-packages/pandas/core/generic.py", line 5815, in astype
    new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
  File "/scratch/users/geibel/miniconda3/envs/dysgu/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 418, in astype
    return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
  File "/scratch/users/geibel/miniconda3/envs/dysgu/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 327, in apply
    applied = getattr(b, f)(**kwargs)
  File "/scratch/users/geibel/miniconda3/envs/dysgu/lib/python3.7/site-packages/pandas/core/internals/blocks.py", line 591, in astype
    new_values = astype_array_safe(values, dtype, copy=copy, errors=errors)
  File "/scratch/users/geibel/miniconda3/envs/dysgu/lib/python3.7/site-packages/pandas/core/dtypes/cast.py", line 1309, in astype_array_safe
    new_values = astype_array(values, dtype, copy=copy)
  File "/scratch/users/geibel/miniconda3/envs/dysgu/lib/python3.7/site-packages/pandas/core/dtypes/cast.py", line 1257, in astype_array
    values = astype_nansafe(values, dtype, copy=copy)
  File "/scratch/users/geibel/miniconda3/envs/dysgu/lib/python3.7/site-packages/pandas/core/dtypes/cast.py", line 1174, in astype_nansafe
    return lib.astype_intsafe(arr, dtype)
  File "pandas/_libs/lib.pyx", line 679, in pandas._libs.lib.astype_intsafe
ValueError: invalid literal for int() with base 10: '.'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/scratch/users/geibel/miniconda3/envs/dysgu/bin/dysgu", line 8, in <module>
    sys.exit(cli())
  File "/scratch/users/geibel/miniconda3/envs/dysgu/lib/python3.7/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/scratch/users/geibel/miniconda3/envs/dysgu/lib/python3.7/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/scratch/users/geibel/miniconda3/envs/dysgu/lib/python3.7/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/scratch/users/geibel/miniconda3/envs/dysgu/lib/python3.7/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/scratch/users/geibel/miniconda3/envs/dysgu/lib/python3.7/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/scratch/users/geibel/miniconda3/envs/dysgu/lib/python3.7/site-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/scratch/users/geibel/miniconda3/envs/dysgu/lib/python3.7/site-packages/dysgu/main.py", line 486, in view_data
    view.view_file(ctx.obj)
  File "/scratch/users/geibel/miniconda3/envs/dysgu/lib/python3.7/site-packages/dysgu/view.py", line 568, in view_file
    df, header, _ = vcf_to_df(item)  # header here, assume all input has same number of fields
  File "/scratch/users/geibel/miniconda3/envs/dysgu/lib/python3.7/site-packages/dysgu/view.py", line 333, in vcf_to_df
    raise ValueError("Problem for feature {}, could not intepret as {}".format(k, dtype))
ValueError: Problem for feature grp_id, could not intepret as <class 'numpy.int64'>

Guess this could be easily fixed by interpreting . as missing when reading the vcfs? However, for me this means that I cannot use --all-sites True and thus the GT tags for 0/0 samples will be only 0 after merging. This is easy to post-fix as no-call sites are still ./., but I loose the SAMPLE/PROB field, which might contain some interesting information...

johannesgeibel avatar Aug 16 '22 14:08 johannesgeibel

I finally got around to addressing the original issue of ignoring the input sample from a --sites file. This now happens by default in v1.6.1 and is controlled by the --ignore-sample-sites option. Regarding the missing genotype issue ., hopefully this has also now been fixed. Marking this as complete, but please open if issues persist.

kcleal avatar Sep 22 '23 14:09 kcleal