AGAT icon indicating copy to clipboard operation
AGAT copied to clipboard

Creating a Hints file for use in Augustus from a gff file

Open sanyalab opened this issue 2 years ago • 6 comments

Hi Jacques,

The hintsfile in Augustus (https://github.com/Gaius-Augustus/Augustus/blob/master/docs/RUNNING-AUGUSTUS.md#using-hints) is a useful method of predicting genes. However, scripts are there to convert BAM (http://manpages.ubuntu.com/manpages/bionic/man1/bam2hints.1.html) and GTF files to the hintsfile (https://github.com/Gaius-Augustus/Augustus/tree/master/scripts). There is no script to process GFF files to get a single hinstfile with all the 16 hints for direct use.

Can this be taken up as an enhancement objective for AGAT. I am typicaly thinking of a GFF file with features (exon, intron, CDS, five_prime_utr, three_prime_utr) that can provide info for (translation start, translation stop, acceptor splice site, donor splice site, exact exon, part of exon, exact intron in CDS/UTR, part of an intron, CDS, CDSpart, UTR, UTRpart)

Thanks Abhijit

sanyalab avatar Sep 19 '21 14:09 sanyalab

Sounds a good task for AGAT. We will think about it.

Juke34 avatar Sep 20 '21 09:09 Juke34

Thanks Jacques. The problem with the GTF2hints scripts is that there are several of them, and one has to tie them together using another join script in the Augustus scripts directory. A one stop shop gff2hints (something like that) would be more suitable.

Thanks for looking into this. Abhijit

sanyalab avatar Sep 21 '21 02:09 sanyalab

Hi Jacques,

I was wondering if this request can still be worked upon.

Regards Abhijit

sanyalab avatar Feb 21 '23 08:02 sanyalab

I would like to develop this but do not have time. Maybe if you could already work on listing all types of hints used by Augustus and what they represent and how they can be generated, e.g. intron corresponds to regions in a gene between exons... it would help.

Juke34 avatar Feb 27 '23 13:02 Juke34

Hi there, I'd also be interested in this. I know very little about perl so while I'd typically be interested to help out with contributing, don't think I'd do it justice. Per your last question @Juke34 , here are all the 16 hint types like @sanyalab mentioned, from the Augustus documentation that was linked above. Direct deeplink below:

https://github.com/Gaius-Augustus/Augustus/blob/master/docs/RUNNING-AUGUSTUS.md#using-hints:~:text=Setting%20the%20bonus%20to%201.0%20disables%20the%20boni.

  • [x] 1. start: translation start (start codon), specifies an interval that contains the start codon. The interval can be larger than 3bp, in which case every ATG in the interval gets a bonus. The highest bonus is given to ATGs in the middle of the interval, the bonus fades off towards the ends.
  • [ ] 2. stop: translation end (stop codon), see 'start'
  • [ ] 3. tss: transcription start site, see 'start'
  • [x] 4. tts: transcription termination site, see 'start'
  • [x] 5. ass: acceptor (3') splice site, the last intron position, for only approximately known ass an interval can be specified
  • [x] 6. dss: donor (5') splice site, the first intron position, for only approximately known dss an interval can be specified
  • [ ] 7. exonpart: part of an exon in the biological sense. The bonus applies only to exons that contain the interval from the hint. Just overlapping means no bonus at all. The malus applies to every base of an exon. Therefore the malus for an exon is exponential in the length of an exon: malus=exonpartmalus^length. Therefore the malus should be close to 1, e.g. 0.99.
  • [x] 8. exon: exon in the biological sense. Only exons that exactly match the hint get a bonus. Exception: The exons that contain the start codon and stop codon. This malus applies to a complete exon independent of its length.
  • [ ] 9. intronpart: introns both between coding and non-coding exons. The bonus applies to every intronic base in the interval of the hint.
  • [x] 10. intron: An intron gets the bonus if and only if it is exactly as in the hint.
  • [ ] 11. CDSpart: part of the coding part of an exon. (CDS = coding sequence)
  • [x] 12. CDS: coding part of an exon with exact boundaries. For internal exons of a multi exon gene this is identical to the biological boundaries of the exon. For the first and the last coding exon the boundaries are the boundaries of the coding sequence (start, stop).
  • [ ] 13. UTR: exact boundaries of a UTR exon or the untranslated part of a partially coding exon.
  • [ ] 14. UTRpart: The hint interval must be included in the UTR part of an exon.
  • [ ] 15. irpart: The bonus applies to every base of the intergenic region. If UTR prediction is turned on (--UTR=on) then UTR is considered genic. If you choose against the usual meaning the bonus of irparts to be much smaller than 1 in the configuration file you can force AUGUSTUS to not predict an intergenic region in the specified interval. This is useful if you want to tell AUGUSTUS that two distant exons belong to the same gene, when AUGUSTUS tends to split that gene into smaller genes.
  • [ ] 16. nonexonpart: intergenic region or intron. The bonus applies to very non-exon base that overlaps with the interval from the hint. It is geometric in the length of that overlap, so choose it close to 1.0. This is useful as a weak kind of masking, e.g. when it is unlikely that a retroposed gene contains a coding region but you do not want to completely forbid exons. genicpart: everything that is not intergenic region, i.e. intron or exon or UTR if applicable. The bonus applies to every genic base that overlaps with the interval from the hint. This can be used in particular to make Augustus predict one gene between positions a and b if a and b are experimentally confirmed to be part of the same gene, e.g. through ESTs from the same clone. alias: nonirpart

photocyte avatar Apr 08 '24 15:04 photocyte

I had work on that a while ago, I forgot to push this work in progress. You can find it in the Augustus branch. The script is called agat_sp_create_augustus_hints.pl.

Start and stop should be by default in the file if not you should use the agat_sp_add_start_and_stop.pl script. Then I should add a function to replace the start_codon and stop_codon feature type by start and stop accordingly.
For UTRs I should add a function to detect synonym and call all of them UTR.
For irpart for now they are called intergenic_region by the script. It should be renamed.
For nonexonpart, it should be a copy of irpart and intron part.
Intronpart should be a copy of intron feature?
Exonpart should be a copy of the exon feature? Some other stuff to decipher... UTRpart, CDS part ...

Juke34 avatar Apr 08 '24 19:04 Juke34