epa-ng icon indicating copy to clipboard operation
epa-ng copied to clipboard

The model parameters under data partitioning

Open pyspider opened this issue 4 years ago • 7 comments

If a phylogenetic tree was generated using likelihood searches under data partitioning. How to get the model parameters? I have a reference tree that was generated by 10 partitions. My query sequences are from one of 10 partitions. I am not sure whether I also may use this program (-f e ) to get model parameters for phylogenetic placement: raxmlHPC-AVX -f e -s $REF_MSA -t $TREE -n file -m GTRGAMMAI

pyspider avatar Oct 21 '19 06:10 pyspider

Unfortunately partitioning is not currently supported in epa-ng. However if you are sure that your query sequences are from just one of those partitions, then I strongly advise that, at least for phylogenetic placement, you trim your reference alignment to that partition, bringing it back to a single partition model. I would also suggest you obtain the model parameters using the new version of RAxML: raxml-ng. See here in the readme for a how-to.

pierrebarbera avatar Oct 21 '19 14:10 pierrebarbera

Thanks for your quick reply! If I understand you right that I don't need to trim the reference alignment and generate new reference tree for phylogenetic placement. I can use the original reference alignment (contains 10 partitions) and the corresponding reference tree to do phylogenetic placement. But I need to trim the reference alignment with my query sequences length to produce new model parameters in order to correctly place the query sequences. For example, I could get the model parameters from the trimmed reference alignment (trim the overlapping region with my query in the papara query+ alignment) first. And then I may use the non-trimmed reference alignment (original reference alignment), the reference tree and new model parameters to run epa-ng. Is it right?

pyspider avatar Oct 25 '19 04:10 pyspider

  • you should obtain the model parameters AND the tree with corrected branch lengths from raxml-ng --evaluate, which you feed with your tree inferred from the full alignment
  • use the resulting tree, trimmed alignment and model parameters for placement
  • (you can use the trimmed alignment here since epa would ignore the rest anyway, since, for a given query, there are only gaps in the rest of the alignment)

It's fairly important that you place on the tree that has branch lengths adjusted to the single partition, since branch lengths and model parameters go hand in hand.

pierrebarbera avatar Oct 25 '19 13:10 pierrebarbera

Thanks! I found I will miss about half of species/tips from reference alignment once I try to use the trimmed alignment. Because many species/tips don't have the aligned gene with query length. Thus I should use full matrix (non-trimmed) to do placement (use the resulting tree, full reference alignment and model parameters for placement). Because all query were aligned to same region after doing papara while epa would ignore gappy sites in the rest of the alignment. It won't have much effect (trimmed vs full matrix). Right?

pyspider avatar Oct 27 '19 00:10 pyspider

You can still do it as I outlined and it won't make a difference, because the taxa that don't have signal where the queries align will essentially be ignored, since they don't have data to compare the query to.

pierrebarbera avatar Oct 28 '19 10:10 pierrebarbera

Thanks for your help! 1) For reoptimisation test, I used a reference tree (from full matrix) and a trimmed reference (exclusively mitochomdrial genes that contains my query) to get model parameters. But I encountered this following error:

Likelihood problem in model optimization l1: -inf l2: -6695681.3973653158172965049743652343750000000000 tolerance: 
0.0000066956813973653154351581111292102122

Whether it means that the reoptimisation inputs on RAxML (-f e) need to entirely match between reference tree and reference alignment. As the branch lengths of reference tree were inferred from corresponding reference alignment. Thun I can't use a trimmed alignment to do reoptimisation. Right?...2)However, If I obtain model parameters from a full matrix. Whether the non-overlapping sites will affect the phylogenetic placement when I placed my query. 3)About the next step (using the model file from full matrix and resulting tree), if I contained gap only sequences in a reference alignment whether they can disturb the likelihood computations...Because missing data (gap only sequences) could be somehow interpreted as probability 1/4 for A T C or G. Or epa-ng still can ignore gap only sequences and their leaves to do phylogenetic placement. About missing data, It won't make a difference for epa-ng.

pyspider avatar Nov 18 '19 04:11 pyspider

I think I will revise my last statement to: (I think) you can in theory do the placement on the trimmed alignment, at least from the perspective of EPA, but looking back I don't think you should. If really only half of your references have the gene you are trying to place, it would be confusing and pretty nonsensical to include them in the tree. Perhaps you should infer a separate reference tree using just the gene that your queries were targeted for. Also you should use raxml-ng :)

pierrebarbera avatar Nov 18 '19 12:11 pierrebarbera