varCA Error: "An output file is marked as pipe, but consuming jobs are part of conflicting groups."

Hi @aryarm ,

I am running your pipeline starting with called peaks and a .BAM file. However, when doing a dry-run, I get the following error:

"Building DAG of jobs... WorkflowError in line 350 of /home/######/projects/varCA/rules/prepare.smk: An output file is marked as pipe, but consuming jobs are part of conflicting groups."

My sample.tsv file looks like:

EMC378 /absPathTo/EMC378.filtered.bam /absPathTo/EMC378_peaks.bed EMC385 /absPathTo/EMC385.filtered.bam /absPathTo/EMC385_peaks.bed ....

In my config.yaml I define:

sample_file: /absPathTo/BSF_1013_2000HIV_hg38_perIndiv_atac_varCA_samples.tsv
genome: /absPathTo/genome.fa
snp_callers: [gatk-snp, varscan-snp, vardict-snp]
indel_callers: [gatk-indel, varscan-indel, vardict-indel, pindel, illumina-strelka]
snp_filter: ['gatk-snp~DP>5']
indel_filter: ['gatk-indel~DP>5']
snp_model: data/snp.rda
indel_model: data/indel.rda

I cloned your repository on Jul 26 13:11. I am running Snakemake 6.6.0

I run the following command to do a dry-run: snakemake --config out="$out_path" -p -n

I might have misunderstood something in the way different folders need to be defined, so apologies in advance if that is the case.

Thanks!

PS. Could it be that the example data is no longer available?

Aug 12 '21 12:08 Rubbert

I created a new conda environment with the recommended snakemake version, and now the error is no longer there. It seems that Snakemake 6.6.0 is not backwards compatible with Snakemake 5.18.0?

Thanks!

Aug 12 '21 13:08 Rubbert

Hi @Rubbert,

Thanks for reporting this issue and for all of the helpful information. Yes, I think there might be a few problems with the pipeline that prevent it from being used with more recent versions of Snakemake, but support for v6.0 is definitely on the list!

I'd also like to improve the test dataset; I want it to be smaller and faster to test VarCA with.

And yet another issue that I've been having is trying to figure out the best way to distribute the test dataset. Currently, I distribute it as an asset as part of every release. But this requires that I reupload the file after every release of VarCA, and I can sometimes forget to upload it, as you've noticed with release v0.3.1 (which I have since fixed! - thanks for letting me know). The solution I'm currently leaning towards is hosting with Google Drive here, since our institution has storage there for free, but if I do that, I don't think users will be able to download it using wget anymore. Let me know if you have any better suggestions!

Anyway, here is where all of those things fall on my current list of priorities, in case you're interested. I'm tracking support for Snakemake v6.0 in #19.

Aug 12 '21 17:08 aryarm

Hi @aryarm ,

I was able to run part of the pipeline on our cluster. However, it looks like Snakemake 5.18.0 does not play well with the slurm cluster profiles (the --profile option) for all jobs. It basically will not run "normalize_vcf", and throws different errors for two different profile definitions/yaml's that worked with Snakemake 6.x.x.

Do you happen to know what causes the error I mentioned above about the "conflicting groups", and is there an easy hack I can implement to see if the it will run with Snakemake 6?

Update: I tested the pipeline with different Snakemake versions. I still can't get it to run on our cluster, but Snakemake 5.27.4 shows the "An output file is marked as pipe, but consuming jobs are part of conflicting groups." error, and Version 5.26.1 is still OK. So it does not appear to be a Snakemake 6.x.x problem, but already a change between 5.27.4 and 5.26.1.

Thanks!

Cheers,

Rob

Aug 16 '21 14:08 Rubbert

Hi @Rubbert ,

Apologies for all of the trouble that this has been causing you!

I'm not entirely sure why Snakemake has different behavior with pipe()ed jobs among those versions. But I've been working on updating VarCA to Snakemake v5.24.2 (as a starting point before I try Snakemake v6.x), and one thing I've tried is rewriting the pipe()s into temp()s. You can see one example of this on the development branch for the upcoming release.

You might want try that, as well. I think it substantially simplifies the DAG resolution because it allows Snakemake to group fewer jobs together. The downside is that it will probably make execution of the pipeline a bit slower because it increases file IO.

I'm hoping to get the upcoming release out soon. It's just proving to be a bit large - there will be a lot of updates!

Aug 16 '21 16:08 aryarm

Hi @Rubbert,

I haven't forgotten about this! I'm just making a note here for myself later:

https://github.com/snakemake/snakemake/issues/975 seems to indicate that piped output no longer works for some versions of 6.x 😞 If that's the case, then I'll probably just convert all the pipe()s to temp()s in the next release.

Sep 06 '21 05:09 aryarm

The original problem I posted is still present in version 6.8.

When you say convert pipe to temp do you mean create concrete files instead of streaming between steps? That won't work for us since using pipe is an important optimization due to the size of our data files.

Sep 10 '21 18:09 pdagosto

Oh, boy. I'm sorry to hear that :(

When you say convert pipe to temp do you mean create concrete files instead of streaming between steps?

Yes, that was what I was proposing. If it doesn't work for you, then I'm fresh out of ideas.

What sort of issues did you have when you tried converting the pipe()s to temp()s? Did you run out of disk space? Because if that's the case, you could try gzip compressing the temp files. Or you could try running fewer of the samples at once by setting a max number of jobs via the -j parameter.

Sep 10 '21 18:09 aryarm

@Rubbert, once you have a chance to try the solution I recommended, can you let me know here? It will save me some time when I go to update VarCA to the newest version of Snakemake.

Sep 10 '21 18:09 aryarm

I'm only working in development mode now so the data files are small. I haven't actually tried it with our actual data files which can be hundreds of GB in size. I did some performance testing last year and found that there was a 3x performance improvement when using piped output instead of writing intermediate files. It's not really a question of disk space. And its's not a question of number of samples being run either since the files for even one sample can be enormous. Zipping the temp files wouldn't help either since compressing such large files is a colossal time suck in and of itself.

Sep 10 '21 18:09 pdagosto

hmm... yeah, I can imagine that we'll see similar performance differences if we convert the pipe()s to temp()s

It would be really nice if someone could resolve the Snakemake issue.

Zipping the temp files wouldn't help either since compressing such large files is a colossal time suck in and of itself.

You might want to briefly explore zipping, anyway? Even small amounts of compression can significantly reduce file IO and my understanding is that gzipping can be relatively fast.

Sep 10 '21 18:09 aryarm

Based on experience I doubt that would help. There are many steps in the pipeline and hundreds of GB of data would be redundantly read, written (and now compressed) to and from disk. Compression is not going to solve this problem.

Sep 10 '21 18:09 pdagosto

varCA varCA copied to clipboard

Error: "An output file is marked as pipe, but consuming jobs are part of conflicting groups."

varCA
varCA copied to clipboard