data.tree icon indicating copy to clipboard operation
data.tree copied to clipboard

Slow conversion between data.tree and Newick format

Open cysouw opened this issue 7 years ago • 4 comments

Is there a way to speed up this conversion, specifically with ToNewick?

library(ape)
data(bird.families)
system.time(bf <- as.Node(bird.families))
system.time(n <- ToNewick(bf))

cysouw avatar Apr 15 '17 13:04 cysouw

That is indeed a bit slow :-( Will look into it at some point, but no promises on when. Very open for pull requests though!

gluc avatar Apr 20 '17 04:04 gluc

Apparently, recursive programming is not ideal inside R :-(, e.g. http://dirk.eddelbuettel.com/blog/2011/09/08/#rcpp_for_recursion

So, there should be some workaround possible ("memoization") but that would sort-of destroy the whole approach of the data.tree package. So, pass everything to C?

cysouw avatar Apr 22 '17 16:04 cysouw

As we are talking a few hundred nodes, recursion is not the problem. Rather, the current DefaultPlotHeight implementation is recursive, and on top it's called recursively by ToNewick. This, of course, is highly sub-optimal.

If you don't need the height in your Newick, the simplest solution is to avoid it, like so:

n <- ToNewick(bf, heightAttribute = NULL)

If you need the plot height, a workaround is to cache the plot height attribute on the nodes, like so:

library(ape)
library(data.tree)
data(bird.families)
bf <- as.Node(bird.families)

SetPlotHeight <- function(node, rootHeight = 100) {
  
  #traverse from leaves towards root to calculate the height and store it in height2
  #Note: cannot call it height, as that already exists on `Node`
  Set(node$leaves, height2 = 1)
  node$Do(function(x) x$height2 <- Aggregate(x, "height2", max) + 1, traversal = "post-order", filterFun = isNotLeaf)
  
  #traverse from root to leaves to calculate the plotHeight
  #(where plotHeight uses the semantics of dendrogram/phylo, where leaves should have height 0 and root height = rootHeight. This meaning is not the same as in the data.tree package)
  node$plotHeight <- rootHeight
  node$Do(function(x) x$plotHeight <- x$parent$plotHeight * (1 - 1 / x$height2), filterFun = isNotRoot)
}

SetPlotHeight(bf)

Then, instead of providing a heightAttribute function, provide the name of the cached attribute:

n <- ToNewick(bf, heightAttribute = "plotHeight")

I need to figure out whether there is a clean way to fix this for good in the DefaultPlotHeight function, thought I doubt it. At least it should be documented in the ToNewick, as.dendrogram.Node, and as.phylo.Node

gluc avatar May 05 '17 20:05 gluc

This is exactly what I was looking for!! This cut down the time to convert a tree with 5971 tips and 1631 internal nodes, from several(!!) minutes to just 2 secs!:

If you don't need the height in your Newick, the simplest solution is to avoid it

Perhaps this is the cause of slowdowns in other conversions like ToDataFrameTable as well? cf. this gist. I'm wondering about the slowness in conversion (of seconds or more) even after pruning to a depth of 2-6. Whereas just almost any other operation like leafCount for example is almost instant for the whole tree.

For context, I'm trying to convert a json file via data.tree to tidytree format to do some visualizations. Any other alternative would also be very helpful.

In the long term, I wonder how hard it would to implement some parts in rcpp? I'm hoping this can be my go to package for hierarchical data. There really is a lack of tools in R for stuff like json. I'd be willing to spend some time if needed to work on this. Since I really need a way to work with and visualize large json structures, any suggestions for alternatives is fine too..

geekyi avatar Mar 02 '20 09:03 geekyi