GeneRax
GeneRax copied to clipboard
incomplete event reporting with NHX reconciliated tree format
Hi Benoit,
I am implementing a parse for GeneRax output, and specifically the reconciled tree in NHX format as that's how they come out of the program with the --reconciliation-samples
option.
Doing so, I noticed that some transfer events were not reported in the NHX format, while they were in the recPhyloXML equivalent. See example below:
(BRADYR26_WSM3983__1:0.249317[&&NHX:S=BRADYR26:D=N:H=N:B=0.249317],(...))n338:0.028047[&&NHX:S=clade112:D=N:H=N:B=0.028047]
<clade>
<name>NULL</name>
<eventsRec>
<speciation speciesLocation="clade112"/>
</eventsRec>
<clade>
<name>BRADYR26_WSM3983__1</name>
<eventsRec>
<branchingOut speciesLocation="clade113"/>
</eventsRec>
<clade>
<name>loss</name>
<eventsRec>
<loss speciesLocation="clade113"/>
</eventsRec>
</clade>
<clade>
<name>BRADYR26_WSM3983__1</name>
<eventsRec>
<transferBack destinationSpecies="BRADYR26"/>
<leaf speciesLocation="BRADYR26"/>
</eventsRec>
</clade>
</clade>
<clade>
...
</clade>
</clade>
(I omitted the content of a big clade but I will send you the full files by email)
I don't know if this is an error or just something that is not covered by this tree format. If it is the latter - or in any case - may I suggest to generate the reconciliation output in the recPhyloXML format instead or in addition of the NHX format only, so to avoid information loss?
I know that the recPhyloXML format is more verbose (~ 6x more) than the NHX format, but still that would not lead to too crazy file sizes, even with 1000 samples in a file; in addition such files would naturally lend themselves to efficient compression given their repetitive nature.
Thanks again for providing this great tool!
All the best,
Florent
NB: this is an issue encountered on output from GeneRax v1.2.0, called with the following command:
generax -r UndatedDTL --max-spr-radius 5 --strategy SPR -s core-genome-based_reference_tree_Brady2019.full_clade_defs.nwk -f GeneRax_recs/generax_prot_pointestimate_7_generax.families -p GeneRax_recs/generax_prot_pointestimate_7 --per-family-rates --reconcile
Hi Florent,
I finally found the time to look at this :-) I understand the issue, and I am not sure that I can fix it...
From what I understand, the NHX format only allows one event per gene node. For instance, one gene branch could go through a sequence of SL (speciation-loss) events, but I don't think I can represent them with NHX (and such that Notung can display it). Instead, I only represent the last event along the gene branch, which has to be an event that either is a leaf or creates two gene children (duplication, HGT, speciation, leaf). The "hidden" (in the sense that I don't represent them in the NHX output) events can be either SL or TL (transfer-loss). DL (duplication-loss) events are never inferred because not observable. If you only want to know whether there was a HGT or not, you can indeed check whether you can explain the reconciliation with SL events only...
I think it's ok to use NHX, but it would be safer/easier to use the RecPhyloXML format, because it does not hide any inferred event. But as you pointed out in your mail, it's very verbose (first, it's XML, and second, it always repeats the species tree...). In principle, I am not against supporting more formats.
I will add a warning about NHX somewhere in the documentation, because I never though about this issue. Thanks for bringing this up :-)
Benoit