metaeuk icon indicating copy to clipboard operation
metaeuk copied to clipboard

The gff format is wrong?

Open xiekunwhy opened this issue 2 years ago • 14 comments

Expected Behavior

Out put a correct(common) gff file like here https://m.ensembl.org/info/website/upload/gff3.html.

Current Behavior

  1. The 2nd column of gff file is annotation source, and the 3rd column is feature type, so I think the gff format of metaeuk results is wrong.
  2. There are repeat id(s) in the 9th column, it's better to make some modification to avoid because many gff tools do not accept duplicate gene ID.

Best, Kun

xiekunwhy avatar Nov 13 '21 09:11 xiekunwhy

Thank you very much for pointing out both issues. Indeed, the order was wrong and is now fixed (2e4ecd3). As for the identifiers, the exon number was added to exon/CDS lines to create a unique identifier. The following should be unique: For a gene: TCS_ID=protein_acc|contig_acc|strand For an mRNA: TCS_ID=protein_acc|contig_acc|strand_mRNA For an exon: TCS_ID=protein_acc|contig_acc|strand_exon_id For a CDS: TCS_ID=protein_acc|contig_acc|strand_CDS_id

elileka avatar Nov 13 '21 20:11 elileka

Thank you for your reply, and I have an other question. Is it possible to keep one hit(maybe the best one) per target protein? Best, Kun

xiekunwhy avatar Nov 27 '21 05:11 xiekunwhy

Let me see if I understand your question. Say you have a target protein T and three contigs C1, C2 and C3. MetaEuk will report up to six^ results for T, one for each contig and strand. You would like to see the best result over ALL contigs?

If so, the answer is that the easiest way to obtain that is to write a script that filters the results, based on the E-value or bitscore. A simple awk (or any other language) can do that (we can assist with the command, if needed).

Assuming you are looking at the GFF output, then you will need to parse the last field and for all lines that have the same Target_ID=protein_acc and type 'gene' and keep the line with the highest bitscore (sixth field).

If I didn't get your question, please let me know.

^comment: We have recently extended MetaEuk to allow finding more than one hit per contig and strand, but that's a different topic and the default setting is still the same as described above.

elileka avatar Nov 27 '21 16:11 elileka

Thank you for your reply, and I was succeeded in dealing with that finaly. An other problem (not so serious) about gff file is that many many many downstream tools(like gffread) recognize "ID=(*);" strictly(but the metaeuk gff is "Target_ID=(*);") and do not allow duplicate "ID=(*);" including gene feature lines. But I can handle this problem myself. And I reformat the original gff like following.

image

Best, Kun

xiekunwhy avatar Dec 01 '21 07:12 xiekunwhy

Thank you very much for this feedback! If you take TCS_ID from each GFF line, you should get a unique identifier (now also including the low_coord). Does it make more sense to put this value before Target_ID in each line? Your solution looks good too, of course.

elileka avatar Dec 01 '21 14:12 elileka

When I use gffread, I got duplicate mRNA errors, sed 's/Target_ID/ID/g' uniref90.gff > uniref90.ID.gff gffread uniref90.ID.gff -g ../Op-f.pctg.mask.fa -x uniref90.ID.cds.fa -y uniref90.ID.pep.fa GFF Error: duplicate/invalid 'mRNA' feature ID=Ofa004732

And Ofa004732 appears many times like following, so it's not a unique identifier. put TCS_ID value before Target_ID in each line is a good idea. image

Best, Kun

xiekunwhy avatar Dec 03 '21 02:12 xiekunwhy

Hi Kun,

The output you sent is in agreement with what I wrote. The identifier after "Target_ID" can repeat several times.

The unique string for each gff line follows the TCS_ID. So in your example: "Ofa004732|scaffold_52|+|9934" and "Ofa004732|scaffold_52|+|9934_mRNA" and "Ofa004732|scaffold_52|+|9934_exon_0" etc...

elileka avatar Dec 03 '21 09:12 elileka

Hi,

I see what you mean, thank you for your reply.

Best, Kun

xiekunwhy avatar Dec 06 '21 01:12 xiekunwhy

I've been looking into the details for the gff file and I'm a bit confused as well:

Here's the gene identifiers:

AYO42509.1|NODE_20_length_120068_cov_12.604801|+|823|3.852e-238|10|194|2282|194[194]:265[265]:72[72]|311[314]:358[358]:48[45]|406[412]:498[498]:93[87]|516[537]:593[593]:78[57]|628[628]:1002[1002]:375[375]|1129[1129]:1254[1254]:126[126]|1272[1287]:1481[1481]:210[195]|1695[1695]:1877[1877]:183[183]|1904[1907]:2122[2122]:219[216]|2139[2157]:2282[2282]:144[126]

Here's the gff records for that gene:

NODE_20_length_120068_cov_12.604801	gene	MetaEuk	195	2283	823	+	.	Target_ID=AYO42509.1;TCS_ID=AYO42509.1|NODE_20_length_120068_cov_12.604801|+
NODE_20_length_120068_cov_12.604801	mRNA	MetaEuk	195	2283	823	+	.	Target_ID=AYO42509.1;TCS_ID=AYO42509.1|NODE_20_length_120068_cov_12.604801|+_mRNA;Parent=AYO42509.1|NODE_20_length_120068_cov_12.604801|+
NODE_20_length_120068_cov_12.604801	exon	MetaEuk	195	266	50	+	.	Target_ID=AYO42509.1;TCS_ID=AYO42509.1|NODE_20_length_120068_cov_12.604801|+_exon;Parent=AYO42509.1|NODE_20_length_120068_cov_12.604801|+_mRNA
NODE_20_length_120068_cov_12.604801	CDS	MetaEuk	195	266	50	+	.	Target_ID=AYO42509.1;TCS_ID=AYO42509.1|NODE_20_length_120068_cov_12.604801|+_CDS;Parent=AYO42509.1|NODE_20_length_120068_cov_12.604801|+_exon
NODE_20_length_120068_cov_12.604801	exon	MetaEuk	312	359	33	+	.	Target_ID=AYO42509.1;TCS_ID=AYO42509.1|NODE_20_length_120068_cov_12.604801|+_exon;Parent=AYO42509.1|NODE_20_length_120068_cov_12.604801|+_mRNA
NODE_20_length_120068_cov_12.604801	CDS	MetaEuk	315	359	33	+	.	Target_ID=AYO42509.1;TCS_ID=AYO42509.1|NODE_20_length_120068_cov_12.604801|+_CDS;Parent=AYO42509.1|NODE_20_length_120068_cov_12.604801|+_exon
NODE_20_length_120068_cov_12.604801	exon	MetaEuk	407	499	69	+	.	Target_ID=AYO42509.1;TCS_ID=AYO42509.1|NODE_20_length_120068_cov_12.604801|+_exon;Parent=AYO42509.1|NODE_20_length_120068_cov_12.604801|+_mRNA
NODE_20_length_120068_cov_12.604801	CDS	MetaEuk	413	499	69	+	.	Target_ID=AYO42509.1;TCS_ID=AYO42509.1|NODE_20_length_120068_cov_12.604801|+_CDS;Parent=AYO42509.1|NODE_20_length_120068_cov_12.604801|+_exon
NODE_20_length_120068_cov_12.604801	exon	MetaEuk	517	594	42	+	.	Target_ID=AYO42509.1;TCS_ID=AYO42509.1|NODE_20_length_120068_cov_12.604801|+_exon;Parent=AYO42509.1|NODE_20_length_120068_cov_12.604801|+_mRNA
NODE_20_length_120068_cov_12.604801	CDS	MetaEuk	538	594	42	+	.	Target_ID=AYO42509.1;TCS_ID=AYO42509.1|NODE_20_length_120068_cov_12.604801|+_CDS;Parent=AYO42509.1|NODE_20_length_120068_cov_12.604801|+_exon
NODE_20_length_120068_cov_12.604801	exon	MetaEuk	629	1003	199	+	.	Target_ID=AYO42509.1;TCS_ID=AYO42509.1|NODE_20_length_120068_cov_12.604801|+_exon;Parent=AYO42509.1|NODE_20_length_120068_cov_12.604801|+_mRNA
NODE_20_length_120068_cov_12.604801	CDS	MetaEuk	629	1003	199	+	.	Target_ID=AYO42509.1;TCS_ID=AYO42509.1|NODE_20_length_120068_cov_12.604801|+_CDS;Parent=AYO42509.1|NODE_20_length_120068_cov_12.604801|+_exon
NODE_20_length_120068_cov_12.604801	exon	MetaEuk	1130	1255	81	+	.	Target_ID=AYO42509.1;TCS_ID=AYO42509.1|NODE_20_length_120068_cov_12.604801|+_exon;Parent=AYO42509.1|NODE_20_length_120068_cov_12.604801|+_mRNA
NODE_20_length_120068_cov_12.604801	CDS	MetaEuk	1130	1255	81	+	.	Target_ID=AYO42509.1;TCS_ID=AYO42509.1|NODE_20_length_120068_cov_12.604801|+_CDS;Parent=AYO42509.1|NODE_20_length_120068_cov_12.604801|+_exon
NODE_20_length_120068_cov_12.604801	exon	MetaEuk	1273	1482	100	+	.	Target_ID=AYO42509.1;TCS_ID=AYO42509.1|NODE_20_length_120068_cov_12.604801|+_exon;Parent=AYO42509.1|NODE_20_length_120068_cov_12.604801|+_mRNA
NODE_20_length_120068_cov_12.604801	CDS	MetaEuk	1288	1482	100	+	.	Target_ID=AYO42509.1;TCS_ID=AYO42509.1|NODE_20_length_120068_cov_12.604801|+_CDS;Parent=AYO42509.1|NODE_20_length_120068_cov_12.604801|+_exon
NODE_20_length_120068_cov_12.604801	exon	MetaEuk	1696	1878	100	+	.	Target_ID=AYO42509.1;TCS_ID=AYO42509.1|NODE_20_length_120068_cov_12.604801|+_exon;Parent=AYO42509.1|NODE_20_length_120068_cov_12.604801|+_mRNA
NODE_20_length_120068_cov_12.604801	CDS	MetaEuk	1696	1878	100	+	.	Target_ID=AYO42509.1;TCS_ID=AYO42509.1|NODE_20_length_120068_cov_12.604801|+_CDS;Parent=AYO42509.1|NODE_20_length_120068_cov_12.604801|+_exon
NODE_20_length_120068_cov_12.604801	exon	MetaEuk	1905	2123	132	+	.	Target_ID=AYO42509.1;TCS_ID=AYO42509.1|NODE_20_length_120068_cov_12.604801|+_exon;Parent=AYO42509.1|NODE_20_length_120068_cov_12.604801|+_mRNA
NODE_20_length_120068_cov_12.604801	CDS	MetaEuk	1908	2123	132	+	.	Target_ID=AYO42509.1;TCS_ID=AYO42509.1|NODE_20_length_120068_cov_12.604801|+_CDS;Parent=AYO42509.1|NODE_20_length_120068_cov_12.604801|+_exon
NODE_20_length_120068_cov_12.604801	exon	MetaEuk	2140	2283	44	+	.	Target_ID=AYO42509.1;TCS_ID=AYO42509.1|NODE_20_length_120068_cov_12.604801|+_exon;Parent=AYO42509.1|NODE_20_length_120068_cov_12.604801|+_mRNA
NODE_20_length_120068_cov_12.604801	CDS	MetaEuk	2158	2283	44	+	.	Target_ID=AYO42509.1;TCS_ID=AYO42509.1|NODE_20_length_120068_cov_12.604801|+_CDS;Parent=AYO42509.1|NODE_20_length_120068_cov_12.604801|+_exon

I've parsed out the exons:

 '194[194]:265[265]:72[72]',
 '311[314]:358[358]:48[45]',
 '406[412]:498[498]:93[87]',
 '516[537]:593[593]:78[57]',
 '628[628]:1002[1002]:375[375]',
 '1129[1129]:1254[1254]:126[126]',
 '1272[1287]:1481[1481]:210[195]',
 '1695[1695]:1877[1877]:183[183]',
 '1904[1907]:2122[2122]:219[216]',
 '2139[2157]:2282[2282]:144[126]'

I'm reading this section of your docs: https://github.com/soedinglab/metaeuk/#the-metaeuk-header

>T_acc|C_acc|S|bitscore|E-Value|number_exons|low_coord|high_coord|exon1_coords|exon2_coords|...

coord refers to the coordination on the contig (first base has coordinate 0). It is advisable to keep T_acc and C_acc short and without pipes. The exon_coords are of the structure: low[taken_low]:high[taken_high]:nucleotide_length[taken_nucleotide_length]

Since MetaEuk allows for a very short overlap on T of two putative exons (see P2 and P3 in the illustration below), when joining the sequences of the exons, one of them is shortened. The coordinates of the codons taken from this exon will be in square brackets ([taken_low], [taken_high] and [taken_nucleotide_length]). These refer to the orange section of P3 below, while the coordinates outside the brackets refer to the yellow+orange section of P3.

The exon positions make sense since you just add 1 to offset. What I don't understand are the following:

  • Why are there multiple CDS?
  • Why do the CDS use the inner brackets?
  • What is more useful for mapping and quantifying gene counts? The inner or outer brackets or some combination of both?

For example, NODE_20_length_120068_cov_12.604801 CDS MetaEuk 1288 1482 100 + . Target_ID=AYO42509.1;TCS_ID=AYO42509.1|NODE_20_length_120068_cov_12.604801|+_CDS;Parent=AYO42509.1|NODE_20_length_120068_cov_12.604801|+_exon

It's taking the part on the inner bracket on the start?

Most GFF records I see only have one CDS per gene. Could I just skip these CDS records or would that be incorrect?

jolespin avatar Mar 04 '22 00:03 jolespin

Hello,

As we disclaim in the README, the GFF format is somewhat forced on MetaEuk as it does not predict introns or UTRs, etc. As you can read in the README, two consecutive exons can potentially match the same (very short) part on the target. In such a case it doesn't make sense to duplicate the short segment they share so we shorten one of the exons. We chose to report this information in the CDS lines. So, each CDS line will be identical to its corresponding exon line unless there was such a shortening, in which case the CDS will report the shortened version. I personally think it is therefore better to take the CDS lines for each gene.

elileka avatar Mar 04 '22 19:03 elileka

Hi, I'm trying to use metaeuk output gff to make a tx2gene file using makeTxDbFromGFF for later use in tximport. The gff format though is all wrong and I don't know how to fix it. Does it seem possible to use my metaeuk-generated gff for this purpose?

I changed "Target_ID" to "ID" like Kun did, since makeTxDbFromGFF needs this, but it still gives me lots of errors: "NA" columns, "incompatible seqnames" and a lot of "non-unique" columns/values. Would it be possible to manually simplify the gff file to fit the format required by makeTxDbFromGFF? I'm new to all this and quite lost, can someone explain it to me like I was an especially dumb toddler, please?

Thanks a bunch! E

ereyred avatar Mar 28 '22 03:03 ereyred

Hi,

Could you please let me know which version you're using? I have fixed some things concerning the GFF format following the above comments. We have NOT released it yet but you can find it in the latest precompiled files: https://mmseqs.com/metaeuk/

elileka avatar Mar 29 '22 19:03 elileka

Hi Eli, I'm using static Linux AVX2 build. Says metaeuk Version: 9818d1a5b155c28b3ef11bfa9b7c69073e669a70 Should I download and use another?

ereyred avatar Mar 30 '22 04:03 ereyred

In case it's helpful, I've developed a MetaEuk gff modifier that produces an output that is a bit easier to interpret and use with existing tools. In particular, it can be concatenated with Prodigal's GFF file (used for prokaryotic gene modeling) and this concatenated GFF file can be used with featureCounts. It was released with the following package/publication which uses/cites MetaEuk:

Espinoza, J.L., Dupont, C.L. VEBA: a modular end-to-end suite for in silico recovery, clustering, and analysis of prokaryotic, microeukaryotic, and viral genomes from metagenomes. BMC Bioinformatics 23, 419 (2022). https://doi.org/10.1186/s12859-022-04973-8

Here's the script: https://github.com/jolespin/veba/blob/main/src/scripts/compile_metaeuk_identifiers.py

Here's an example of some of the output:

  • gene_models.gff image

SRR17458614__METABAT2__E.1__bin.2.gff.gz

  • identifier_mapping.metaeuk.tsv.gz image

identifier_mapping.metaeuk.tsv.gz

It retains the original identifiers as well but the long identifiers with | characters can sometimes be problematic in some pipelines.

Hope this helps.

jolespin avatar Nov 04 '22 17:11 jolespin