Which output file is best?
Hi,
I am slightly confused over the output files and wanted to check which one is best.
As I understand it, the tree is made and including an outgroup improves results.
Orthogroups (HOGs) are then inferred at each level of the tree.
But are the HOGs output in N0.tsv, which presumably includes all species used, always the best?
Or, if you want to compare two species in particular for genes present/absent/different between them, are the files for the node containing those two species (but not higher than this) better? i.e. if you had SpeciesA and Species B, with others, coming from node 2, would N2.tsv be better than N0.tsv, which might include other branches containing SpeciesE and Species F?
If there is a difference (i.e. not just a post-analysis filtering of information going on), are the lower nodes (N1 - Nn.tsv) more specific and more sensitive, and do they include the information gained from the use of all species, just further refining it?
Many thanks, Jenni
Just to update the above. I pulled out the genes at N0, N2 and N6 (the last contained my species, the middle my species cf another species of interest and N0 included the outgroup species).
I found that I had more genes in N0.tsv compared with N2.tsv, and fewer still in N6.tsv.
The number of genes shared in HOGs between the two species I was interested in comparing increased from N2.tsv to N0.tsv. However, all those that were shared at N2 were still shared at N0.
In contrast, of the genes that were identified in separate HOGs at N2, some were then present in shared HOGs at N0. However, other shared genes at N0 had not been present at N2.
Finally, some additional genes (38 for my species of interest) were found to be present in the N0.tsv data, but were not shared with the other species.
Taking all genes found in N0.tsv I found I was still missing genes from the number used by orthofinder as input. These I found in the N6.tsv, which did not appear to be shared with any other species I had included in the analysis.
Therefore, I decided (and please do correct me if I am wrong!), that the output files contain genes present in orthogroups at that level of the tree. Therefore, at N0 all species are considered, and the use of the outgroup species at this point may lead to grouping of more divergent genes into the same HOG, which may have been missed at a lower level node. Genes particular to a species (not shared with any other species) are output only in the node file that the species is directly found in (in this case N6.tsv for my species).
So, to take a very conservative approach to identify all possible orthogroups/orthologues I would use the highest node on the tree (N0.tsv). However, if I wanted to be more specific, and ignore genes that were more divergent I would use a lower node. To find the genes missing from the top node I would look lower down the tree to where they appear.
Hope this is helpful, and please do correct if I'm wrong! I've tried the paper, the video and github and finally just went back to the files... Jenni