AGAT icon indicating copy to clipboard operation
AGAT copied to clipboard

AGAT stopped at OmniscientI stage

Open ardy20 opened this issue 1 year ago • 4 comments

Hello

I am trying to push results of Interproscan (tsv) and diamondp (protein) search (like blastp) into my braker.gtf file (from Braker2 structural annotation). The main reason is completing functional annotation. But AGAT stops at OmniscientI stage with no error.

The database is too large: uni.parc.fasta

The script that I run is:

agat_sp_manage_functional_annotation.pl -f braker.gtf -b matches.m6 --db uniparc_active.fasta -i augustus.intpro.aa.fasta.tsv --output agustus_final.interpro

I also noticed that I get error:

41492 warning messages: gff3 reader error level2: No ID attribute found

I checked my gtf file and looks that it has ID attribute. Please see a few lines of gtf below:

chr10 AUGUSTUS stop_codon 883140 883142 . - 0 transcript_id "file_1_file_1_g10730.t1"; gene_id "file_1_file_1_g10730"; chr10 AUGUSTUS CDS 883140 883326 0.76 - 1 transcript_id "file_1_file_1_g10730.t1"; gene_id "file_1_file_1_g10730"; chr10 AUGUSTUS exon 883140 883326 . - . transcript_id "file_1_file_1_g10730.t1"; gene_id "file_1_file_1_g10730"; chr10 AUGUSTUS intron 883327 883416 1 - . transcript_id "file_1_file_1_g10730.t1"; gene_id "file_1_file_1_g10730"; chr10 AUGUSTUS CDS 883417 884917 1 - 2 transcript_id "file_1_file_1_g10730.t1"; gene_id "file_1_file_1_g10730"; chr10 AUGUSTUS exon 883417 884917 . - . transcript_id "file_1_file_1_g10730.t1"; gene_id "file_1_file_1_g10730"; chr10 AUGUSTUS intron 884918 888069 1 - . transcript_id "file_1_file_1_g10730.t1"; gene_id "file_1_file_1_g10730"; chr10 AUGUSTUS CDS 888070 888346 0.8 - 0 transcript_id "file_1_file_1_g10730.t1"; gene_id "file_1_file_1_g10730";

Now I am really baffled and cannot recognise the issue. Could this be because of problem in gtf file or my script?

The below is the final report from AGAT but it does not proceed through and it looks that nothing is happening:

------------------------------ done in 1 seconds -------------------------------


  •                            - End checks -                                *
    
  •                          done in 54 seconds                              *
    

=> OmniscientI total time: 239 seconds [19:28:45] Parsing Finished [19:28:45] Compute statistics [19:29:12] Look at the fasta database

ardy20 avatar Sep 10 '22 00:09 ardy20

1st column of the matches.m6 should be mRNA identifiers. I guess it is not your case (You might show file samples to check). The standard workflow will:

  • first standardize your braker.gtf with agat_convert_gxf2gxf.pl,
  • then extract the protein using agat_sp_extract_sequences.pl (the fasta identifier here will be the mRNA ID),
  • then perform the diamondp,
  • and finaly use the diamondp output to feed agat_sp_manage_functional_annotation.pl.

This might be of interest: https://nbisweden.github.io/workshop-genome_annotation_elixir/labs/functional_annotation

Juke34 avatar Sep 14 '22 07:09 Juke34

Hello Jacques

Thanks for the comment. Below I provide the first 10 lines of my matches.m6 and braker.gtf file, respectively:

g5517.t1 UPI001C4E852D 94.7 94 5 0 1 94 1 94 1.13e-60 190 g5517.t1 UPI0008FBE5BE 93.6 94 6 0 1 94 1 94 8.39e-57 182 g5517.t1 UPI000E701874 92.6 94 7 0 1 94 1 94 9.84e-57 182 g5517.t1 UPI001C4F0E59 93.6 94 6 0 1 94 1 94 1.19e-56 182 g5517.t1 UPI00052EA367 93.6 94 6 0 1 94 1 94 1.19e-56 182 g5517.t1 UPI000E6FEB2C 92.6 94 7 0 1 94 1 94 1.69e-56 182 g5517.t1 UPI001CC6F236 92.6 94 7 0 1 94 1 94 2.40e-56 182 g5517.t1 UPI001A0B771D 92.6 94 7 0 1 94 1 94 6.86e-56 181 g5517.t1 UPI0019F464ED 91.5 94 8 0 1 94 1 94 1.38e-55 180 g5517.t1 UPI000E70091F 90.4 94 9 0 1 94 1 94 2.10e-55 179

############################################################## gtf file

chr2 AUGUSTUS gene 1 613 0.19 - . g1 chr2 AUGUSTUS transcript 1 613 0.19 - . g1.t1 chr2 AUGUSTUS intron 1 23 0.66 - . transcript_id "g1.t1"; gene_id "g1"; chr2 AUGUSTUS CDS 24 367 0.4 - 1 transcript_id "g1.t1"; gene_id "g1"; chr2 AUGUSTUS exon 24 367 . - . transcript_id "g1.t1"; gene_id "g1"; chr2 AUGUSTUS intron 368 476 0.29 - . transcript_id "g1.t1"; gene_id "g1"; chr2 AUGUSTUS CDS 477 613 0.26 - 0 transcript_id "g1.t1"; gene_id "g1"; chr2 AUGUSTUS exon 477 613 . - . transcript_id "g1.t1"; gene_id "g1"; chr2 AUGUSTUS start_codon 611 613 . - 0 transcript_id "g1.t1"; gene_id "g1"; chr2 AUGUSTUS gene 12012 21215 0.21 - . g2

I also will try your recommended script and come back to you.

ardy20 avatar Sep 15 '22 01:09 ardy20

Hello Jacques Which script should be used for protein extract and what the input file should be in: agat_sp_extract_sequences.pl

Should this be: agat_sp_extract_sequences.pl -g infile.gff -f infile.fasta -t exon --merge

What is the infile.fasta?

ardy20 avatar Sep 15 '22 05:09 ardy20

You can have a look at the help agat_sp_extract_sequences.pl -h, or here https://www.biostars.org/p/9465973/ and here https://agat.readthedocs.io/en/latest/tools/agat_sp_extract_sequences.html

Proteins are made of CDS not exons...

In short it will be:

agat_sp_extract_sequences.pl -g infile.gff -f infile.fasta -p

Juke34 avatar Sep 15 '22 08:09 Juke34

Dear Developer

I followed your guidelines and this time AGAT worked perfectly. Thanks for this useful toolbox. One more question, Is there any tool that can polish the gft file for submission to NCBI?

ardy20 avatar Oct 05 '22 05:10 ardy20

I guess you can process the GTF with AGAT (gxf2gxf script) and it should be standardize to be submitted as a GFF file.

Juke34 avatar Oct 06 '22 11:10 Juke34

Can we use more threads for a speedy process?

I am trying to merge the blast results to my gff file that I already combined the interproscan inside.

However, the following command could not be completed using 24 Cpus, 120 Gb Ram even after 72 hours:

agat_sp_manage_functional_annotation.pl -f augustus.standard.gtf -b matches.m6 --db uniparc_active.fasta -o final.interpro.blast

Is this a right command? It stops (with no error) at:

=> OmniscientI total time: 201 seconds [12:37:20] Parsing Finished [12:37:20] Compute statistics [12:37:44] Look at the fasta database

ardy20 avatar Oct 10 '22 07:10 ardy20

Hi, AGAT is not multithreaded, it is useless to provide more than 1 CPU. (FYI I had plan to update the code to multithread the parsing but I'm do not have time to spend to develep this type of feature for now.). The longest part is ofthen the parsing, here 201 seconds, so the rest should not be such slow. Probably the problem is elsewhere. Which AGAT version are you using? What version of bioperl? You could try to remove the database created automatically in the working folder, it could be corrupted (what is the size of the created database?)

Juke34 avatar Oct 10 '22 08:10 Juke34

Hi

OK, Thanks.

I installed via conda and then updated agat to version 9. The data base that I use in the following script is uniprot.fasta protein file directly downloaded from the uniprot database. Should I create a database first using blastp or diamond?

I used this script: agat_sp_manage_functional_annotation.pl -f augustus.standard.gtf -b matches.m6 --db uniparc_active.fasta -o final.interpro.blast

should I use this instead?

agat_sp_manage_functional_annotation.pl -f infile.gff [ -b blast_infile --db uniprot.fasta -i interpro_infile.tsv --id ABCDEF --output outfile ]

ardy20 avatar Oct 10 '22 08:10 ardy20

Your command sounds good. Is uniparc_active.fasta the db you use to create the matches.m6 file? I had an issue on OSX when they updated the HDD system, the DB created (index file) was completely broken and was several GBs size instead of Ko... it has been fixed after few updates. Could you show you conda configuration? conda list and tell me if you are using an OSX or Linux?

Juke34 avatar Oct 10 '22 08:10 Juke34

Yes, The uniparc_active.fasta is the uniprot fasta file that I downlaided from uniprot database. Then I indexed it in another folder using diamondp then I used diamond (comparable to blastp) to search (blastp) database.

Should the indexed database be kept in the same working directory of agat?

I use Linux not OSX.

Please also see the following:

conda list

packages in environment at /home/user/miniconda3:

Name Version Build Channel

_libgcc_mutex 0.1 main _openmp_mutex 4.5 1_gnu brotlipy 0.7.0 py37h27cfd23_1003 ca-certificates 2021.7.5 h06a4308_1 certifi 2021.5.30 py37h06a4308_0 cffi 1.14.6 py37h400218f_0 chardet 4.0.0 py37h06a4308_1003 conda 4.10.3 py37h06a4308_0 conda-package-handling 1.7.3 py37h27cfd23_1 cryptography 3.4.7 py37hd23ed53_0 idna 2.10 pyhd3eb1b0_0 ld_impl_linux-64 2.35.1 h7274673_9 libffi 3.3 he6710b0_2 libgcc-ng 9.3.0 h5101ec6_17 libgomp 9.3.0 h5101ec6_17 libstdcxx-ng 9.3.0 hd4cf53a_17 ncurses 6.2 he6710b0_1 openssl 1.1.1k h27cfd23_0 pip 21.1.3 py37h06a4308_0 pycosat 0.6.3 py37h27cfd23_0 pycparser 2.20 py_2 pyopenssl 20.0.1 pyhd3eb1b0_1 pysocks 1.7.1 py37_1 python 3.7.10 h12debd9_4 readline 8.1 h27cfd23_0 requests 2.25.1 pyhd3eb1b0_0 ruamel_yaml 0.15.100 py37h27cfd23_0 setuptools 52.0.0 py37h06a4308_0 six 1.16.0 pyhd3eb1b0_0 sqlite 3.36.0 hc218d9a_0 tk 8.6.10 hbc83047_0 tqdm 4.61.2 pyhd3eb1b0_1 urllib3 1.26.6 pyhd3eb1b0_1 wheel 0.36.2 pyhd3eb1b0_0 xz 5.2.5 h7b6447c_0 yaml 0.2.5 h7b6447c_0 zlib 1.2.11 h7b6447c_3

ardy20 avatar Oct 10 '22 09:10 ardy20

Should the indexed database be kept in the same working directory of agat?

No AGAT should index in its own way the fast file.

Juke34 avatar Oct 22 '22 10:10 Juke34