Augustus icon indicating copy to clipboard operation
Augustus copied to clipboard

Format .gb for training

Open kimnegrette3 opened this issue 4 years ago • 2 comments

Hi! I want to train augustus using a .gbff file I downloaded from ncbi https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/611/645/GCF_000611645.1_mono_v1/GCF_000611645.1_mono_v1_genomic.gbff.gz But the command: randomSplit.pl Monoraphidium_neglectum_genomic.gbff 100 fails with "size 100 is greater than the number of genes in file Monoraphidium_neglectum_genomic.gbff. Aborting." The file of course has more than 100 genes, but it seems that the format is not quite well. What should I exactly change in the file? I would really appreciate any help. Thanks!

Kimberly.

kimnegrette3 avatar May 19 '20 19:05 kimnegrette3

You can download the same annotation in gff3 format, as well as the genome sequence. (Possibly you need to simplify sequence names in both files.) Use this to generate the GenBank file for training AUGUSTUS. Please do not use all genes. 2000 - 10000 genes are sufficient.

Our tools don’t work on NCBIs gbgff format.

Katharina

On Tue 19. May 2020 at 21:52, kimnegrette3 [email protected] wrote:

Hi! I want to train augustus using a .gbff file I downloaded from ncbi https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/611/645/GCF_000611645.1_mono_v1/GCF_000611645.1_mono_v1_genomic.gbff.gz But the command: randomSplit.pl Monoraphidium_neglectum_genomic.gbff 100 fails with "size 100 is greater than the number of genes in file Monoraphidium_neglectum_genomic.gbff. Aborting." The file of course has more than 100 genes, but it seems that the format is not quite well. What should I exactly change in the file? I would really appreciate any help. Thanks!

Kimberly.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Gaius-Augustus/Augustus/issues/150, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJMC6JB7DEI4OFOSAVYBO3TRSLPQVANCNFSM4NFJTNAQ .

KatharinaHoff avatar May 19 '20 19:05 KatharinaHoff

I had the same issue with the gb format. It comes from tha fact that randomSplits.pl searches for "LOCUS" as gene tag while in the genebank format the tag is given by "gene". I solved it by downloading the gff and genome fasta then gff2gbSmallDNA.pl genome.gff3 genome.fasta 100 genome_augustus.gb (use more than 100 for eukaryotes !) and then random splits worked

lalalagartija avatar Mar 13 '24 10:03 lalalagartija