pasta
pasta copied to clipboard
"Error in rule pasta" when aligning paralogous gene sequences
Hello! I'm working with a program called ROADIES (link) that uses PASTA. ROADIES basically produces guide trees from provided genome assemblies by selecting a bunch of random "c-genes" (non-recombining sites within the genomes), finding each c-gene in each genome assembly, and aligning the matches to the c-gene with PASTA to produce a gene tree for each c-gene. These gene trees are then merged to create a species tree.
However, I keep encountering an issue - the program exits due to an error with PASTA when the "c-gene" happens to be a sequence from a highly paralogous gene family . Examples: KCNJ potassium transporters and olfactory receptor genes. I cannot for the life of me understand why the program crashes -- I have confirmed that there is adequate memory and CPU allocated to the program, and that this issue doesn't seem to be on the ROADIES end (it's a PASTA error).
Here's an example of the error message encountered when one of the c-genes (in this case, a randomly selected "gene 213" that corresponds to a sequence from EPPK1). EPPK1 is a member of a family of cytoskeletal elements.
Error in rule pasta:
jobid: 710
input: /lab/solexa_page/finn/ROADIES/ROADIES_output_10-24B/genes/gene_213.fa
output: /lab/solexa_page/finn/ROADIES/ROADIES_output_10-24B/genes/gene_213.fa.aln
shell:
if [[ `grep -n '>' /lab/solexa_page/finn/ROADIES/ROADIES_output_10-24B/genes/gene_213.fa | wc -l` -gt 15 ]] || [[ `awk 'BEGIN{l=0;n=0;st=0}{if (substr($0,1,1) == ">") {st=1} else {st=2}; if(st==1) {n+=1} else if(st==2) {l+=length($0)}} END{if (n>0) {print int((l+n-1)/n)} else {print 0} }' /lab/solexa_page/finn/ROADIES/ROADIES_output_10-24B/genes/gene_213.fa` -gt 750 ]]
then
input_file=/lab/solexa_page/finn/ROADIES/ROADIES_output_10-24B/genes/gene_213.fa
output_file=/lab/solexa_page/finn/ROADIES/ROADIES_output_10-24B/genes/gene_213.fa.aln
reference=""
all_matched=true
while IFS= read -r line; do
line=$(echo "$line" | tr '[:lower:]' '[:upper:]')
if [[ "$line" != ">"* ]]; then
if [ -z "$reference" ]; then
reference="$line"
elif [ "$line" != "$reference" ]; then
all_matched=false
break
fi
fi
done < "$input_file"
if [ "$all_matched" = true ]; then
cp "$input_file" "$output_file"
else
run_pasta.py -i /lab/solexa_page/finn/ROADIES/ROADIES_output_10-24B/genes/gene_213.fa -j gene_213 --alignment-suffix=fa.aln --num-cpus 4
fi
fi
touch /lab/solexa_page/finn/ROADIES/ROADIES_output_10-24B/genes/gene_213.fa.aln
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
Removing output files of failed job pasta since they might be corrupted:
/lab/solexa_page/finn/ROADIES/ROADIES_output_10-24B/genes/gene_213.fa.aln
Has anyone encountered this error before / can anyone explain to me what this section of code is trying to do, and why it might be failing?
Thank you so much for any help!