OrthoFinder
OrthoFinder copied to clipboard
problem with DNA sequence detection error
Hi David,
Thanks for this amazing tool!
I was just trying to run OrthoFinder (newest version) on some of the proteins from the EukProt database. These are often derived from transcriptomes, so have some weird sequences. On trying to run orthofinder I got the ".fasta appears to contain nucleotide sequences ..." error message. There were no DNA sequences in the protein fasta files. However I wrote a script to check the number of sequences with an [ATGCatcg] content >80% and removed those. Running OrthoFinder was then ok. So I think possible improvements are as follows:
- The error message should state which files are affected, rather than just ".fasta ...".
- Presumably there's an internal limit for proportion of [ATGCatcg] content. It seems to be less than 0.99. Why not make the requirement simply that it cannot be equal to one?
- An option to discount affected sequences if they are found? Or at least a script to allow users to remove those sequences?
Thanks! Alastair
Hi Alastair
Thanks, they're good suggestions, I'll try implementing them in the new version. From what I remember I think OrthoFinder only checked the first 100 sequences for performance reasons, but this might be an over-optimisation. Do you have a link to one of the files where this is a problem so I can investigate?
All the best David
Hi David,
Here's a link to a proteome file that has three examples of such sequences: https://we.tl/t-fo67Dgyctn
It's proteome number EP00315 from the EukProt proteome set.
Thanks and best wishes, Alastair
Hi Alistair
Thanks for the example files, I've just got back from holiday and so wasn't able to download the files before the link expired. Would you be able to send a fresh link for me?
Many thanks David
Hi David,
We are running OrthoFinder for proteins predicted from fungal genomes and are getting an identical error where it states: G.morbida.faa appears to contain nucleotide sequences instead of amino acid sequences. We have double checked that the file contains only amino acid sequences and have also filtered out proteins with 50% ATCG content just to troubleshoot and are still receiving the error. Do you have any thoughts on why this might be? We have confirmed that OrthoFinder doesn't have this issue for predicted proteins from tree genomes, so it seems to be exclusive to this fungal data set.
Thank you, Aaron
We realized that we had an error during an earlier filtering step and proteins with E, F, I, L, P, or Q were removed because they are not nucleotides or nucleotide degeneracy codes. After fixing the filtering error, Orthofinder performed as expected.
Thank you!
Hello David, i got the following error:
(orthofinder) ubuntu@networkvm1-f775d:/mnt/nwdata/mapo/orthofinder/proteomes$ orthofinder -S diamond -t 14 -a 14 -f /mnt/nwdata/mapo/orthofiinder/test -o /mnt/nwdata/mapo/orthofinder/orthofinder_2021-06
OrthoFinder version 2.5.2 Copyright (C) 2014 David Emms
2021-06-21 16:16:38 : Starting OrthoFinder 2.5.2
14 thread(s) for highly parallel tasks (BLAST searches etc.)
14 thread(s) for OrthoFinder algorithm
Checking required programs are installed
----------------------------------------
Test can run "mcl -h" - ok
Test can run "fastme -i /mnt/nwdata/mapo/orthofinder/orthofinder_2021-06/Results_Jun21/WorkingDirectory/SimpleTest.phy -o /mnt/nwdata/mapo/orthofinder/orthofinder_2021-06/Results_Jun21/WorkingDirectory/SimpleTest.tre" - ok
ERROR: Anthoceros_angustus_header_changed.faa appears to contain nucleotide sequences instead of amino acid sequences. Use '-d' option
ERROR: Atrichopoda_291_v1.0.protein_primaryTranscriptOnly_header_changed.faa appears to contain nucleotide sequences instead of amino acid sequences. Use '-d' option
ERROR: An error occurred, ***please review the error messages*** they may contain useful information about the problem.
Example Sequence from Data:
>AANG012485
VYTSHHSDSASSAVRVSARRWTSKRYVNGDGAKMDKHRKTASHGADTNVYNWNRKARAKRKARANSSDGVGDHRDGGSKKRMDVSSVSVDAVAG*
>AmTr_scaffold00353.4
YYYSSYYKSSSYVYKSSSYVYKSHRRMCTCHHHHRHHMSSHHHHHHHHHTSTSRHHRHHTST*
>Azfi_s1664.g105489
YMVFRASNKHHNCMLDLLGRAGRINEAVAALEQLPYQPDHVTWNTILGKANLMILKENRISITATQHGLQPSHDSQCD*
I also tried to change nchar to 60. But the error doesn't change. Files are stored here: https://we.tl/t-pjFo2lQEhT
Thank you!
Hello David, I seem to have the same problem, when I launch orthofinder, it does not recognise one of the file i use as a aminoacid sequence but as a nucleotide sequence :
dogbo@lug:~/documents/orthologie_sur_OR$ ~/programmes/OrthoFinder/orthofinder -f directory_Ortho_Or
OrthoFinder version 2.5.4 Copyright (C) 2014 David Emms
2023-04-24 18:00:56 : Starting OrthoFinder 2.5.4 72 thread(s) for highly parallel tasks (BLAST searches etc.) 9 thread(s) for OrthoFinder algorithm
Checking required programs are installed
Test can run "mcl -h" - ok Test can run "fastme -i /home/dogbo/documents/orthologie_sur_OR/directory_Ortho_Or/OrthoFinder/Results_Apr24_3/WorkingDirectory/SimpleTest.phy -o /home/dogbo/documents/orthologie_sur_OR/directory_Ortho_Or/OrthoFinder/Results_Apr24_3/WorkingDirectory/SimpleTest.tre" - ok ERROR: OR_Dmel.faa appears to contain nucleotide sequences instead of amino acid sequences. Use '-d' option ERROR: An error occurred, please review the error messages they may contain useful information about the problem.`