smoove icon indicating copy to clipboard operation
smoove copied to clipboard

Errors when run with large number of samples

Open JoannaTan opened this issue 5 years ago • 7 comments

Hi @brentp,

I ran smoove v0.2.2 using singularity and I run it as part of a nextflow process. When I execute with small sample size (i.e. 30, 100, and 300), all the smoove commands work perfectly and the whole process is able to complete. However, when I increase my sample size to 654 samples, I kept getting errors that are not reproducible.

One of the errors that I get is: [smoove] 2018/12/20 04:18:13 starting with version 0.2.2 [smoove] 2018/12/20 04:18:13 squaring 654 files to 654sample.smoove.square.vcf.gz [smoove] 2018/12/20 04:19:38 files: WHH443_joint-smoove.genotyped.vcf.gz had 65509 variants [smoove] 2018/12/20 04:19:38 653 files had 65507 variants [smoove] 2018/12/20 04:19:38 please make sure that all files have the same number of variants

But when I execute the smoove genotype command on singularity shell for the file, it has 66507 variants. I checked the original file with 66509 variants and noticed that 2 of the variants were repeated in the file.

Another error which I get is: [smoove] 2018/12/19 03:52:57 starting with version 0.2.2 [smoove] 2018/12/19 03:52:57 merging 654 files [smoove] 2018/12/19 03:52:57 finished sorting 654 files; merge starting. [smoove] 2018/12/19 03:58:11 Required tag PREND not found. Please ensure you've run lumpy with the -P option to emit breakpoint probabilities. 2018/12/19 03:58:11 exit status 1

I noticed that the process always break either at the smoove merge or smoove paste step.

Thank you.

Best regards, Joanna

JoannaTan avatar Dec 31 '18 01:12 JoannaTan

hi, this was reported elsewhere, but I haven't been able to reproduce. Could you share the full set of commands that you are using? I think maybe if you use only a single thread for the call and genotype steps, it will prevent the only way I can see this from happening. I'll have a careful look at the code in January and see how this can possibly happen. Meanwhile. Please share your full set of commands. You can also grep for any file that does not have PREND in every line.

brentp avatar Dec 31 '18 17:12 brentp

Hi @brentp

Thank you for your help.

Please find below the commands that I used.


  #smoove call variants per sample
  smoove call --name ${sampleid} --fasta ${genome} --genotype ${bam} -p 3 --outdir ./

  #smoove merge
  smoove merge --name ${prefix} -f ${genome} ${vcf} --outdir ./

  #smoove genotype for each sample at the sites
  smoove genotype -x -p 3 --name ${samid}_joint --fasta ${genomefasta} --vcf ${merged_vcf} ${bamfile} --outdir ./

  #combine all the genotype
  smoove paste --name ${prefix} ${jointvcf}

  #Annotation
  smoove annotate -gff ${gff} ${squarevcf} | bgzip -c > ${prefix}.smoove.square.anno.vcf.gz

Thank you.

JoannaTan avatar Jan 02 '19 00:01 JoannaTan

Hi, I am not able to see how this could happen. I am continuing to look for ways to recreate. One way that this might happen is if you have one (or more) sample(s) represented twice in the same cohort. So the files could be overwriting each other. In your example, that would be if samid or sampleid were not unique.

You could also try running everything with a single thread, though I think that's unlikely to resolve the issue.

If the wrong number of lines consistently occurs for one sample, it would be great if you could share the data so I can have a look.

brentp avatar Jan 07 '19 18:01 brentp

Hi @brentp,

I tried running everything with a single thread, and it works. =)

I checked that all the samid are unique, so there will not be cases of overwriting.

JoannaTan avatar Jan 08 '19 01:01 JoannaTan

@JoannaTan thanks for verifying. I still don't see how this could happen, but at least we have a work-around. We've run this on > 50K samples and not seen this problem, but it might be that we always use a single thread. I'll keep digging.

brentp avatar Jan 08 '19 02:01 brentp

I just ran into this issue and am trying to re-run on a single thread. I'm hopeful given the above thread!

Thatguy027 avatar Mar 19 '19 07:03 Thatguy027

Just following up - changing smoove genotype -p 1 solved the discrepancy in number of sites per sample.

Not sure that I'd be able to reproduce the problem, but can send along the intermediate VCFs that had different numbers of sites if it will be helpful.

BTW, following up on the merging thread from yesterday - I checked the the previous version of smoove I was running where I had no issues with thread numbers - v0.1.9. In case that helps your digging around.

Thatguy027 avatar Mar 20 '19 04:03 Thatguy027