bedtools2 icon indicating copy to clipboard operation
bedtools2 copied to clipboard

bedtools intersect fails when GFF file contains a ##FASTA section

Open sjackman opened this issue 10 years ago • 12 comments

❯❯❯ bedtools intersect -v -a a.gff -b b.gff
Error: line number 11752 of file a.gff has 1 fields, but 9 were expected.

a.gff

…
38  prokka  gene    981 1121    .   -   1   ID=OU3MT_05875_gene;locus_tag=OU3MT_05875
##FASTA
>1
TTACAAGCTTGAAAGAATGAATGACATCCTTTTTCTTCCCTCAATGCTATAGAAATAAGG
…

sjackman avatar May 13 '15 00:05 sjackman

Hmm, is this standard/expected? I have frankly never seen that before.

arq5x avatar May 13 '15 00:05 arq5x

I believe that ##FASTA is standard. That's the format output by Prokka. @tseemann

sjackman avatar May 13 '15 00:05 sjackman

You learn something new every day. Is it one entire section of ##FASTA or are such sections interleaved throughout the file?

arq5x avatar May 13 '15 00:05 arq5x

I think it's just one ##FASTA section at the end of the file, but I don't know the for-certain answer to that question. Perhaps @tseemann knows.

sjackman avatar May 13 '15 05:05 sjackman

Hi, It's still odd that it failed, as the ##FASTA should have been recognized as a header line, added to the header, and skipped, even if encountered after lines of actual data. I'll look into this. Thanks for reporting it.

nkindlon avatar May 13 '15 12:05 nkindlon

Hi Neil, The issue is the actual FASTA sequence line thereafter.

arq5x avatar May 13 '15 12:05 arq5x

Oooooh. Sorry, didn't read carefully enough. Well I guess we could rig it to skip that for GFF files, now that we know this is possible.

nkindlon avatar May 13 '15 13:05 nkindlon

Here's the reference. Search for ##FASTA. http://www.sequenceontology.org/gff3.shtml

##FASTA

This notation indicates that the annotation portion of the file is at an end and that the remainder of the file contains one or more sequences (nucleotide or protein) in FASTA format. This allows features and sequences to be bundled together. All FASTA sequences included in the file must be included together at the end of the file and may not be interspersed with the features lines. Once a ##FASTA section is encountered no other content beyond valid FASTA sequence is allowed.

sjackman avatar May 13 '15 17:05 sjackman

Thanks Shaun. We won't be able to get a fix for this into the forthcoming release, but we will try to tackle it for the subsequent release. Thanks for pointing this out!

arq5x avatar May 18 '15 02:05 arq5x

No worries. It's easy enough to remove the ##FASTA section with sed '/^##FASTA$/,$d'

sjackman avatar May 20 '15 06:05 sjackman

It's not 100% clear from the GFF3 spec if you can not have the ##FASTA section and still put >fasta sequences at the end anyway. I've seen files like that. And BioPerl allows it when parsing in.

But it is true that have to be at the end only. I actually think it would be nice to allow interleaving. Sort of a "literate annotation" system with the isoforms and genes etc within the annotation. I once had a colleague who hand-annotated the whole malaria genome by inserting markup and spacing etc within the genome FASTA file. it was machine parseable - I converted it to Genbank at the time!

You've probably never seen ##FASTA because it doesn't make sense for massive genomes with long term references. But for bacteria, the genomes are equally frequent as the annotations!

tseemann avatar Jun 14 '15 01:06 tseemann

This problem seems to still exist and the suggested workaround doesn't actually work (you need to remove everything after and including the ##FASTA line).

Here's my workaround

#!/bin/bash
#
# Script finds first line that begins with `##FASTA` and removes that and all following lines.
#
# Background: Prokka generates GFF files with FASTA sequences at the end.
#   The program `bedtools` doesn't like this and throws an error.
#

INFILE=$1;
OUTFILE=${INFILE/.gff/_no-fasta.gff};
SEARCH_TERM="^##FASTA"

# Find line number using grep, remove additional info returned by grep and then reduce by 1
# Line number code taken from: https://stackoverflow.com/a/22020562/5322644
LINE_NUM=$(( $(grep -n -m 1 $SEARCH_TERM $INFILE |sed  's/\([0-9]*\).*/\1/') - 1))
if [ $INFILE != $OUTFILE ]; then
   head -n $LINE_NUM "$INFILE" > "$OUTFILE";
fi

echo "Successfully removed fasta info from $INFILE and saved to $OUTFILE."

I'm amazed more folks don't encounter this problem.

mikegilchrist avatar Jun 05 '23 20:06 mikegilchrist