Augustus icon indicating copy to clipboard operation
Augustus copied to clipboard

Problem with training gene structures

Open menoxx opened this issue 2 years ago • 0 comments

Hi, I want to generate training gene structures from assembled NovaSeq sequence data and PacBio (Sequel II) genome sequences. I followed the procedures in the “augustus.wrp.pdf” and stuck at: :$ gtf2aa.pl genome.fa bonafide.f.gtf prot.aa
The generated port.aa file does not contain separated protein sequences. Instead, it contains one long aminoacid string separated by * as stop. The file prot.aa contains:

MSQPSSPTIHLIREERELEKHLYAHPMPFLVVPRRLKSKSFFIFEQEAELQESFLYLFFMDNQFISFSLGVIFSIAFFTAFKGWRCSSNVFFELELEKQKKKAEYLLNRSWNHYCNTTGREGQLPEGLSFKDMIPLFIEMEDNIKEKILSLNQMNLDLISNGMESDYFINIINSYGLMTSSNTRSGLIPYQKSLSQLLEKFEEVTFFHLSREKKKFADALTTLASMTKIDCGVDLQPLQIEARDSQAYCLNIEEEPDGRPWYHGMLQYIKTWEYPPGATEKENNSEAIHGVLPMQSMISNPAEAIPERGFIP....... and more..

I don’t think this is correct ? is there an easy solution ?

I would like to make a non-redundant set of genes for the next steps.

Here my detailed description of my procedure: Hardware: 2xAMD (64 cores each), 2TB RAM, 21 GB SSD System Ubuntu 20.04.3 LTS

I named the file containing the transcripts cdna.fa (562763 transcripts) and the file for the genome genome.fa (570 contigs)

I followed the description from the pdf “augustus_wrp.pdf” The cdna.fa (>seq) as well as the genome.fa (>contig) headers were simplified by simplifyFastaHeaders.pl

The parameters from the template file were modified: alignAssembly.config

PASA: :$ Launch_PASA_pipeline.pl -c alignAssembly.config -C -R -g genome.fa -t cdna.fa.clean -T -u cdna.fa --ALIGNERS blat,gmap --CPU 96

Problem: it worked only with --CPU 2 Is there a reason why ?

:$ pasa_asmbls_to_training_set.dbi --pasa_transcripts_fasta sample_mydb_pasa.assemblies.fasta --pasa_transcripts_gff3 sample_mydb_pasa.pasa_assemblies.gff3

:$ grep complete sample_mydb_pasa.assemblies.fasta.transdecoder.cds | perl -pe ’s/>(\S+)./$1/’ > complete.orfs :$ grep -F -f complete.orfs sample_mydb_pasa.assemblies.fasta.transdecoder.genome.gff3 | grep -P "(\tCDS\t|\texon\t)" | perl -pe ‘s/cds.//; s/.exon\d+//;’ > trainingSetComplete.gff3 :$ mv trainingSetComplete.gff3 trainingSetComplete.temp.gff3 :$ cat trainingSetComplete.temp.gff3 | perl -pe ‘s/\t\S(asmbl_\d+).*/\t$1/’ | sort -n -k 4 | sort -s -k 9 | sort -s -k 1,1 > trainingSetComplete.gff3 :$ ln -s trainingSetComplete.gff3 bonafide.gtf :$ computeFlankingRegion.pl bonafide.gtf :$ gff2gbSmallDNA.pl bonafide.gtf genome.fa 10000 bonafide.gb :$ cat bonafide.gb | perl -ne ‘if(m//gene="(\S+)"/){ \print """.$1.""\n";}’ | sort -u > traingenes.lst :$ grep -f traingenes.lst -F bonafide.gtf > bonafide.f.gtf

Unfortunately, the file bonafide.f.gtf is empty. When I have removed the “ from the traingenes.lst it worked.

:$ ln -s bonafide.gtf bonafide.f.gtf

And finally:

:$ gtf2aa.pl genome.fa bonafide.f.gtf prot.aa

I do not know if the problem is due to PASA or Augustus...

menoxx avatar Dec 28 '21 13:12 menoxx