sunbeam icon indicating copy to clipboard operation
sunbeam copied to clipboard

Number of sequences not the same in paired files after decontamination

Open Anqi-Dai opened this issue 3 years ago • 1 comments

Hi sunbeam team, Thank you for developing this pipeline, which looks very exciting. I have a question, I created a test_R1.fastq.gz and a test_R2.fastq.gz using the first 20000 rows in one of my sample in the R1 and R2 files. And after decontamination , there is 19636 left in R1 and 19620 left in R2, so the unequal number is giving megahit the errors: "Number of sequences not the same in paired files". I feel the decontamination should be able to give equal number paired end reads, but I can't find what tool was used to do the decontamination. How to best fix this issue in the current sunbeam pipeline?

Also, I have already qc-ed and decontaminated my samples, is it possible to start only the assembly right away?

Thank you Angel

Anqi-Dai avatar Jun 01 '21 21:06 Anqi-Dai

I have seen the same issue but in my case the step that kept the unpaired reads was Komplexity. You can check if that is the case by counting the number of reads per sample at each step. What worked for me was to grab the sequences at that step and repair them with another program . I used BBmap repair.sh with uncompressed fastq and piped the output to a folder called out, then I deleted single reads, replaced the old paired files with the new ones, removed decontam files and clean files and resume the pipeline

for i in ls -1 *1.fastq | sed 's/_1.fastq//'; do repair.sh in1=$i\_1.fastq in2=$i\_2.fastq out1=out/$i\_1.fastq out2=out/$i\_2.fastq outs=$i\_single.fastq repair; done

carden24 avatar Oct 21 '21 17:10 carden24

Thanks for that answer @carden24.

@Anqi-Dai in order to only run a later part of the pipeline, there's a sort of snakemake hack you can use with the --allowed-rules flag. So in your case where you want to skip qc and just run assembly, premake your sunbeam-output directory and then in that make the qc dir with the decontam dir inside that. So you should have a directory structure of sunbeam-output/qc/decontam. Then move your pre-qc-ed files into decontam.

This will mimic the output needed to start the assembly rules. Now to get snakemake to ignore the missing outputs for all the earlier qc steps, run sunbeam run --configfile sunbeam_config.yml all_assembly --allowed-rules megahit_paired final_filter clean_assembly all_assembly.

--allowed-rules ALLOWED_RULES [ALLOWED_RULES ...] Only consider given rules. If omitted, all rules in Snakefile are used. Note that this is intended primarily for internal use and may lead to unexpected results otherwise. (default: None)

Note the help text on this flag, this is sort of intended for use internal to snakemake, so no guarantees it'll work as you expect all the time.

If you have HPC to spare, you can also always run sunbeam run --configfile sunbeam_config.yml all_qc, then replace the contents of sunbeam_output/qc/decontam with your files and run sunbeam run --configfile sunbeam_config.yml all_assembly.

Ulthran avatar Sep 29 '22 21:09 Ulthran