Augustus
Augustus copied to clipboard
exon-exon-junctions scripts missing
Hi,
I follow Incorporating Illumina RNAseq into AUGUSTUS with Tophat. Unfortunately, it requires intron2exex.pl
which has been replaced by getIntrons.pl
. However, Augustus 3.3.3 does not contain the new script.
Where could I find getIntrons.pl
or is another way to do it?
Thank you in advance,
Michal
Dear Michal, please do not use Tophat for this, anymore. Hisat2 will automatically do what we call "iterative mapping". Also 2-pass mapping with STAR will accomplish your goal. Both aligners will be much faster and very accurate. Best, Katharina
Dear Katharina, I have started to use STAR, but how is it possible to include the second pass of STAR and Augustus?
At the moment came so far:
#Index
mkdir star_genome
STAR --genomeChrBinNbits $1 --runThreadN 4 --runMode genomeGenerate --genomeDir star_genome --genomeFastaFiles $2
#Algignment
STAR --runThreadN 8 --outBAMsortingThreadN 8 --genomeDir $2 --readFilesIn $r1 $r2 --readFilesCommand zcat --outSAMattrRGline ID:${output} SM:${output} LB:${output} PL:ILLUMINA --outSAMtype BAM Unsorted SortedByCoordinate --outFileNamePrefix ${3}/${output}
#Unique
filterBam --uniq --paired --pairwiseAlignments --in $r1 --out ${r1}.uniq.bam
#bam2augustus
bam2hints –intronsonly --in=${r1} --out=${r1}.intron-hints.gff
bam2wig $r1 > ${r1}.wig
cat ${r1}.wig | wig2hints.pl --width=10 --margin=10 --minthresh=2 --minscore=4 --prune=0.1 --src=W --type=ep --radius=4.5 --pri=4 --strand="." > ${r1}.ep-hints.gff
cat ${r1}.intron-hints.gff ${r1}.ep-hints.gff > ${r1}.hints.gff
#Augustus
cat extrinsic.cfg
[SOURCES]
M RM E W
exonpart 1 .992 M 1 1e+100 RM 1 1 E 1 1 W 1 1.005
intron 1 .34 M 1 1e+100 RM 1 1 E 1 1e5 W 1 1
CDSpart 1 1 0.985 M 1 1e+100 RM 1 1 E 1 1 W 1 1
UTRpart 1 1 0.985 M 1 1e+100 RM 1 1 E 1 1 W 1 1
nonexonpart 1 1 M 1 1e+100 RM 1 1.01 E 1 1 W 1 1
augustus ${asm} --genemodel=complete --species=coyote_tobacco --introns=on --UTR=on --extrinsicCfgFile=$3 --alternatives-from-evidence=true --uniqueGeneId=true --hintsfile=${hints} --allow_hinted_splicesites=atac > ${2}/${output}.out
Thank you in advance,
Michal
The key is that after building the index and running the alignments (first round), you re-run STAR before converting the final output of the second STAR run to hints for AUGUSTUS.
For the second run of STAR, you need to append the following argument to the call:
--sjdbFileChrStartEnd first_run_output_directory/SJ.out.tab
(I always write results of the second run to a separate output directory.)
Katharina
Thank you for your reply. After which step should I run hints for AUGUSTUS? Additionally, I found here a walkthrough of the STAR 2-pass alignment steps:
1) STAR uses genome index files that must be saved in unique directories. The human genome index was built from the FASTA file hg19.fa as follows:
genomeDir=/path/to/hg19
mkdir $genomeDir
STAR --runMode genomeGenerate --genomeDir $genomeDir --genomeFastaFiles hg19.fa\ --runThreadN <n>
2) Alignment jobs were executed as follows:
runDir=/path/to/1pass
mkdir $runDir
cd $runDir
STAR --genomeDir $genomeDir --readFilesIn mate1.fq mate2.fq --runThreadN <n>
3) For the 2-pass STAR, a new index is then created using splice junction information contained in the file SJ.out.tab from the first pass:
genomeDir=/path/to/hg19_2pass
mkdir $genomeDir
STAR --runMode genomeGenerate --genomeDir $genomeDir --genomeFastaFiles hg19.fa \
--sjdbFileChrStartEnd /path/to/1pass/SJ.out.tab --sjdbOverhang 75 --runThreadN <n>
4) The resulting index is then used to produce the final alignments as follows:
runDir=/path/to/2pass
mkdir $runDir
cd $runDir
STAR --genomeDir $genomeDir --readFilesIn mate1.fq mate2.fq --runThreadN <n>
Thank you in advance,
Michal
In our genome, we expect to see 50.000 to 70.000 genes. However, 3 out of 19 choromosomes appears to have too many genes:
> grep "NbV1Ch01" NbV1Ch01.gff3 | grep -v "#" > NbV1Ch01.clean.gff3
> grep "gene" NbV1Ch01.clean.gff3 | wc -l
17186
> grep "NbV1Ch02" NbV1Ch02.gff3 | grep -v "#" > NbV1Ch02.clean.gff3
> grep "gene" NbV1Ch02.clean.gff3 | wc -l
11847
> grep "NbV1Ch03" NbV1Ch03.gff3 | grep -v "#" > NbV1Ch03.clean.gff3
> grep "gene" NbV1Ch03.clean.gff3 | wc -l
13669
Is there a way to reduce the number of genes?
After 2-pass of STAR alignments, I ran the following commands:
#Unique
filterBam --uniq --paired --pairwiseAlignments --in $r1 --out ${r1}.uniq.bam
#bam2augustus
bam2hints –intronsonly --in=${r1} --out=${r1}.intron-hints.gff
bam2wig $r1 > ${r1}.wig
cat ${r1}.wig | wig2hints.pl --width=10 --margin=10 --minthresh=2 --minscore=4 --prune=0.1 --src=W --type=ep --radius=4.5 --pri=4 --strand="." > ${r1}.ep-hints.gff
cat ${r1}.intron-hints.gff ${r1}.ep-hints.gff > ${r1}.hints.gff
#Augustus
cat extrinsic.cfg
[SOURCES]
M RM E W
exonpart 1 .992 M 1 1e+100 RM 1 1 E 1 1 W 1 1.005
intron 1 .34 M 1 1e+100 RM 1 1 E 1 1e5 W 1 1
CDSpart 1 1 0.985 M 1 1e+100 RM 1 1 E 1 1 W 1 1
UTRpart 1 1 0.985 M 1 1e+100 RM 1 1 E 1 1 W 1 1
nonexonpart 1 1 M 1 1e+100 RM 1 1.01 E 1 1 W 1 1
#augustus
augustus ${asm} --noInFrameStop=true --genemodel=complete --species=coyote_tobacco --gff3=on --UTR=on --extrinsicCfgFile=$3 --alternatives-from-sampling=false --alternatives-from-evidence=false --uniqueGeneId=true --hintsfile=${hints} --allow_hinted_splicesites=atac > ${2}/${output}.gff3
Thank you in advance,
Michal