OrthoFinder icon indicating copy to clipboard operation
OrthoFinder copied to clipboard

problem with DNA sequence detection error

Open skeffington opened this issue 5 years ago • 7 comments

Hi David,

Thanks for this amazing tool!

I was just trying to run OrthoFinder (newest version) on some of the proteins from the EukProt database. These are often derived from transcriptomes, so have some weird sequences. On trying to run orthofinder I got the ".fasta appears to contain nucleotide sequences ..." error message. There were no DNA sequences in the protein fasta files. However I wrote a script to check the number of sequences with an [ATGCatcg] content >80% and removed those. Running OrthoFinder was then ok. So I think possible improvements are as follows:

  1. The error message should state which files are affected, rather than just ".fasta ...".
  2. Presumably there's an internal limit for proportion of [ATGCatcg] content. It seems to be less than 0.99. Why not make the requirement simply that it cannot be equal to one?
  3. An option to discount affected sequences if they are found? Or at least a script to allow users to remove those sequences?

Thanks! Alastair

skeffington avatar Jul 16 '20 15:07 skeffington

Hi Alastair

Thanks, they're good suggestions, I'll try implementing them in the new version. From what I remember I think OrthoFinder only checked the first 100 sequences for performance reasons, but this might be an over-optimisation. Do you have a link to one of the files where this is a problem so I can investigate?

All the best David

davidemms avatar Jul 17 '20 05:07 davidemms

Hi David,

Here's a link to a proteome file that has three examples of such sequences: https://we.tl/t-fo67Dgyctn

It's proteome number EP00315 from the EukProt proteome set.

Thanks and best wishes, Alastair

skeffington avatar Jul 18 '20 21:07 skeffington

Hi Alistair

Thanks for the example files, I've just got back from holiday and so wasn't able to download the files before the link expired. Would you be able to send a fresh link for me?

Many thanks David

davidemms avatar Jul 28 '20 10:07 davidemms

Hi David,

We are running OrthoFinder for proteins predicted from fungal genomes and are getting an identical error where it states: G.morbida.faa appears to contain nucleotide sequences instead of amino acid sequences. We have double checked that the file contains only amino acid sequences and have also filtered out proteins with 50% ATCG content just to troubleshoot and are still receiving the error. Do you have any thoughts on why this might be? We have confirmed that OrthoFinder doesn't have this issue for predicted proteins from tree genomes, so it seems to be exclusive to this fungal data set.

Thank you, Aaron

aonufrak avatar Oct 02 '20 15:10 aonufrak

We realized that we had an error during an earlier filtering step and proteins with E, F, I, L, P, or Q were removed because they are not nucleotides or nucleotide degeneracy codes. After fixing the filtering error, Orthofinder performed as expected.

Thank you!

aonufrak avatar Oct 06 '20 23:10 aonufrak

Hello David, i got the following error:

(orthofinder) ubuntu@networkvm1-f775d:/mnt/nwdata/mapo/orthofinder/proteomes$ orthofinder -S diamond -t 14 -a 14 -f /mnt/nwdata/mapo/orthofiinder/test -o /mnt/nwdata/mapo/orthofinder/orthofinder_2021-06

OrthoFinder version 2.5.2 Copyright (C) 2014 David Emms

2021-06-21 16:16:38 : Starting OrthoFinder 2.5.2
14 thread(s) for highly parallel tasks (BLAST searches etc.)
14 thread(s) for OrthoFinder algorithm

Checking required programs are installed
----------------------------------------
Test can run "mcl -h" - ok
Test can run "fastme -i /mnt/nwdata/mapo/orthofinder/orthofinder_2021-06/Results_Jun21/WorkingDirectory/SimpleTest.phy -o /mnt/nwdata/mapo/orthofinder/orthofinder_2021-06/Results_Jun21/WorkingDirectory/SimpleTest.tre" - ok
ERROR: Anthoceros_angustus_header_changed.faa appears to contain nucleotide sequences instead of amino acid sequences. Use '-d' option
ERROR: Atrichopoda_291_v1.0.protein_primaryTranscriptOnly_header_changed.faa appears to contain nucleotide sequences instead of amino acid sequences. Use '-d' option
ERROR: An error occurred, ***please review the error messages*** they may contain useful information about the problem.

Example Sequence from Data:

>AANG012485
VYTSHHSDSASSAVRVSARRWTSKRYVNGDGAKMDKHRKTASHGADTNVYNWNRKARAKRKARANSSDGVGDHRDGGSKKRMDVSSVSVDAVAG*

>AmTr_scaffold00353.4
YYYSSYYKSSSYVYKSSSYVYKSHRRMCTCHHHHRHHMSSHHHHHHHHHTSTSRHHRHHTST*

>Azfi_s1664.g105489
YMVFRASNKHHNCMLDLLGRAGRINEAVAALEQLPYQPDHVTWNTILGKANLMILKENRISITATQHGLQPSHDSQCD*

I also tried to change nchar to 60. But the error doesn't change. Files are stored here: https://we.tl/t-pjFo2lQEhT

Thank you!

mapoNW avatar Jun 21 '21 17:06 mapoNW

Hello David, I seem to have the same problem, when I launch orthofinder, it does not recognise one of the file i use as a aminoacid sequence but as a nucleotide sequence :

dogbo@lug:~/documents/orthologie_sur_OR$ ~/programmes/OrthoFinder/orthofinder -f directory_Ortho_Or

OrthoFinder version 2.5.4 Copyright (C) 2014 David Emms

2023-04-24 18:00:56 : Starting OrthoFinder 2.5.4 72 thread(s) for highly parallel tasks (BLAST searches etc.) 9 thread(s) for OrthoFinder algorithm

Checking required programs are installed

Test can run "mcl -h" - ok Test can run "fastme -i /home/dogbo/documents/orthologie_sur_OR/directory_Ortho_Or/OrthoFinder/Results_Apr24_3/WorkingDirectory/SimpleTest.phy -o /home/dogbo/documents/orthologie_sur_OR/directory_Ortho_Or/OrthoFinder/Results_Apr24_3/WorkingDirectory/SimpleTest.tre" - ok ERROR: OR_Dmel.faa appears to contain nucleotide sequences instead of amino acid sequences. Use '-d' option ERROR: An error occurred, please review the error messages they may contain useful information about the problem.`

egcem1 avatar Apr 24 '23 16:04 egcem1