pasta icon indicating copy to clipboard operation
pasta copied to clipboard

"Error in rule pasta" when aligning paralogous gene sequences

Open finn314 opened this issue 4 months ago • 0 comments

Hello! I'm working with a program called ROADIES (link) that uses PASTA. ROADIES basically produces guide trees from provided genome assemblies by selecting a bunch of random "c-genes" (non-recombining sites within the genomes), finding each c-gene in each genome assembly, and aligning the matches to the c-gene with PASTA to produce a gene tree for each c-gene. These gene trees are then merged to create a species tree.

However, I keep encountering an issue - the program exits due to an error with PASTA when the "c-gene" happens to be a sequence from a highly paralogous gene family . Examples: KCNJ potassium transporters and olfactory receptor genes. I cannot for the life of me understand why the program crashes -- I have confirmed that there is adequate memory and CPU allocated to the program, and that this issue doesn't seem to be on the ROADIES end (it's a PASTA error).

Here's an example of the error message encountered when one of the c-genes (in this case, a randomly selected "gene 213" that corresponds to a sequence from EPPK1). EPPK1 is a member of a family of cytoskeletal elements.

Error in rule pasta:
    jobid: 710
    input: /lab/solexa_page/finn/ROADIES/ROADIES_output_10-24B/genes/gene_213.fa
    output: /lab/solexa_page/finn/ROADIES/ROADIES_output_10-24B/genes/gene_213.fa.aln
    shell:
        
		if [[ `grep -n '>' /lab/solexa_page/finn/ROADIES/ROADIES_output_10-24B/genes/gene_213.fa | wc -l` -gt 15 ]] || [[ `awk 'BEGIN{l=0;n=0;st=0}{if (substr($0,1,1) == ">") {st=1} else {st=2}; if(st==1) {n+=1} else if(st==2) {l+=length($0)}} END{if (n>0) {print int((l+n-1)/n)} else {print 0} }' /lab/solexa_page/finn/ROADIES/ROADIES_output_10-24B/genes/gene_213.fa` -gt 750 ]]
		then
			input_file=/lab/solexa_page/finn/ROADIES/ROADIES_output_10-24B/genes/gene_213.fa
			output_file=/lab/solexa_page/finn/ROADIES/ROADIES_output_10-24B/genes/gene_213.fa.aln
			reference=""
			all_matched=true

			while IFS= read -r line; do
				line=$(echo "$line" | tr '[:lower:]' '[:upper:]')
  				if [[ "$line" != ">"* ]]; then
    				if [ -z "$reference" ]; then
      					reference="$line"
    				elif [ "$line" != "$reference" ]; then
       					all_matched=false
      					break
    				fi
 	 			fi
			done < "$input_file"

			if [ "$all_matched" = true ]; then
				cp "$input_file" "$output_file"
			else
				run_pasta.py -i /lab/solexa_page/finn/ROADIES/ROADIES_output_10-24B/genes/gene_213.fa -j gene_213 --alignment-suffix=fa.aln --num-cpus 4
			fi
		fi
		touch /lab/solexa_page/finn/ROADIES/ROADIES_output_10-24B/genes/gene_213.fa.aln

		
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Removing output files of failed job pasta since they might be corrupted:
/lab/solexa_page/finn/ROADIES/ROADIES_output_10-24B/genes/gene_213.fa.aln

Has anyone encountered this error before / can anyone explain to me what this section of code is trying to do, and why it might be failing?

Thank you so much for any help!

finn314 avatar Oct 24 '24 20:10 finn314