Augustus icon indicating copy to clipboard operation
Augustus copied to clipboard

exon-exon-junctions scripts missing

Open mictadlo opened this issue 4 years ago • 5 comments

Hi, I follow Incorporating Illumina RNAseq into AUGUSTUS with Tophat. Unfortunately, it requires intron2exex.pl which has been replaced by getIntrons.pl. However, Augustus 3.3.3 does not contain the new script.

Where could I find getIntrons.pl or is another way to do it?

Thank you in advance,

Michal

mictadlo avatar Apr 18 '20 11:04 mictadlo

Dear Michal, please do not use Tophat for this, anymore. Hisat2 will automatically do what we call "iterative mapping". Also 2-pass mapping with STAR will accomplish your goal. Both aligners will be much faster and very accurate. Best, Katharina

KatharinaHoff avatar Apr 20 '20 05:04 KatharinaHoff

Dear Katharina, I have started to use STAR, but how is it possible to include the second pass of STAR and Augustus?

At the moment came so far:

#Index
mkdir star_genome
STAR --genomeChrBinNbits $1 --runThreadN 4 --runMode genomeGenerate --genomeDir star_genome --genomeFastaFiles $2


#Algignment
STAR --runThreadN 8 --outBAMsortingThreadN 8 --genomeDir $2 --readFilesIn $r1 $r2 --readFilesCommand zcat --outSAMattrRGline ID:${output} SM:${output} LB:${output} PL:ILLUMINA --outSAMtype BAM Unsorted SortedByCoordinate --outFileNamePrefix ${3}/${output}

#Unique
filterBam --uniq  --paired --pairwiseAlignments --in $r1 --out ${r1}.uniq.bam


#bam2augustus
bam2hints –intronsonly --in=${r1} --out=${r1}.intron-hints.gff

bam2wig $r1 > ${r1}.wig

cat ${r1}.wig | wig2hints.pl --width=10 --margin=10 --minthresh=2 --minscore=4 --prune=0.1 --src=W --type=ep --radius=4.5 --pri=4 --strand="." > ${r1}.ep-hints.gff

cat ${r1}.intron-hints.gff ${r1}.ep-hints.gff > ${r1}.hints.gff 

#Augustus
cat extrinsic.cfg
[SOURCES]
M RM E W

exonpart    1   .992    M 1 1e+100 RM 1 1    E 1 1   W 1 1.005
intron      1   .34     M 1 1e+100 RM 1 1    E 1 1e5 W 1 1
CDSpart     1   1 0.985 M 1 1e+100 RM 1 1    E 1 1   W 1 1
UTRpart     1   1 0.985 M 1 1e+100 RM 1 1    E 1 1   W 1 1
nonexonpart 1   1       M 1 1e+100 RM 1 1.01 E 1 1   W 1 1



augustus ${asm} --genemodel=complete --species=coyote_tobacco --introns=on --UTR=on --extrinsicCfgFile=$3 --alternatives-from-evidence=true --uniqueGeneId=true --hintsfile=${hints} --allow_hinted_splicesites=atac > ${2}/${output}.out

Thank you in advance,

Michal

mictadlo avatar Apr 22 '20 13:04 mictadlo

The key is that after building the index and running the alignments (first round), you re-run STAR before converting the final output of the second STAR run to hints for AUGUSTUS.

For the second run of STAR, you need to append the following argument to the call:

--sjdbFileChrStartEnd first_run_output_directory/SJ.out.tab

(I always write results of the second run to a separate output directory.)

Katharina

KatharinaHoff avatar Apr 23 '20 06:04 KatharinaHoff

Thank you for your reply. After which step should I run hints for AUGUSTUS? Additionally, I found here a walkthrough of the STAR 2-pass alignment steps:

1) STAR uses genome index files that must be saved in unique directories. The human genome index was built from the FASTA file hg19.fa as follows:

genomeDir=/path/to/hg19
mkdir $genomeDir
STAR --runMode genomeGenerate --genomeDir $genomeDir --genomeFastaFiles hg19.fa\  --runThreadN <n>

2) Alignment jobs were executed as follows:

runDir=/path/to/1pass
mkdir $runDir
cd $runDir
STAR --genomeDir $genomeDir --readFilesIn mate1.fq mate2.fq --runThreadN <n>

3) For the 2-pass STAR, a new index is then created using splice junction information contained in the file SJ.out.tab from the first pass:

genomeDir=/path/to/hg19_2pass
mkdir $genomeDir
STAR --runMode genomeGenerate --genomeDir $genomeDir --genomeFastaFiles hg19.fa \
    --sjdbFileChrStartEnd /path/to/1pass/SJ.out.tab --sjdbOverhang 75 --runThreadN <n>

4) The resulting index is then used to produce the final alignments as follows:

runDir=/path/to/2pass
mkdir $runDir
cd $runDir
STAR --genomeDir $genomeDir --readFilesIn mate1.fq mate2.fq --runThreadN <n>

Thank you in advance,

Michal

mictadlo avatar Apr 23 '20 16:04 mictadlo

In our genome, we expect to see 50.000 to 70.000 genes. However, 3 out of 19 choromosomes appears to have too many genes:

> grep  "NbV1Ch01" NbV1Ch01.gff3 | grep -v "#" > NbV1Ch01.clean.gff3
> grep "gene" NbV1Ch01.clean.gff3 | wc -l
17186
> grep  "NbV1Ch02" NbV1Ch02.gff3 | grep -v "#" > NbV1Ch02.clean.gff3
> grep "gene" NbV1Ch02.clean.gff3 | wc -l
11847
> grep  "NbV1Ch03" NbV1Ch03.gff3 | grep -v "#" > NbV1Ch03.clean.gff3
> grep "gene" NbV1Ch03.clean.gff3 | wc -l
13669

Is there a way to reduce the number of genes?

After 2-pass of STAR alignments, I ran the following commands:

#Unique
filterBam --uniq  --paired --pairwiseAlignments --in $r1 --out ${r1}.uniq.bam

#bam2augustus
bam2hints –intronsonly --in=${r1} --out=${r1}.intron-hints.gff

bam2wig $r1 > ${r1}.wig

cat ${r1}.wig | wig2hints.pl --width=10 --margin=10 --minthresh=2 --minscore=4 --prune=0.1 --src=W --type=ep --radius=4.5 --pri=4 --strand="." > ${r1}.ep-hints.gff

cat ${r1}.intron-hints.gff ${r1}.ep-hints.gff > ${r1}.hints.gff 

#Augustus
cat extrinsic.cfg
[SOURCES]
M RM E W

exonpart    1   .992    M 1 1e+100 RM 1 1    E 1 1   W 1 1.005
intron      1   .34     M 1 1e+100 RM 1 1    E 1 1e5 W 1 1
CDSpart     1   1 0.985 M 1 1e+100 RM 1 1    E 1 1   W 1 1
UTRpart     1   1 0.985 M 1 1e+100 RM 1 1    E 1 1   W 1 1
nonexonpart 1   1       M 1 1e+100 RM 1 1.01 E 1 1   W 1 1

#augustus
augustus ${asm} --noInFrameStop=true --genemodel=complete --species=coyote_tobacco --gff3=on --UTR=on --extrinsicCfgFile=$3 --alternatives-from-sampling=false --alternatives-from-evidence=false --uniqueGeneId=true --hintsfile=${hints} --allow_hinted_splicesites=atac > ${2}/${output}.gff3

Thank you in advance,

Michal

mictadlo avatar May 07 '20 16:05 mictadlo