Reimplement topological representation
The topological representation should be updated to better represent the phylogenetic forest. Currently we represent phylogenetic DAGs in a newtyped non-empty list to represent a phylogenetic forest. This make it difficult to represent data structure invariant that apply to a phylogenetic forest. Specifically, that each leaf should appear in the phylogenetic forest exactly once. Instead we should make the phylogenetic forest the "primitive" representation.
We should have the leaf nodes places in indices [0..n-1] of the vector with internal nodes occupying indices [n..len-1]. Once allocated, the leaf nodes should never have their indices altered. This will allow for more efficient bit-vector representation in the resolution cache and extracting the leaf set, amongst other things.
This representation, where the phylogenetic forest is the primitive representation, will allow for more natural implementation of breaking a phylogenetic forest "component" on an edge, since the result will necessarily be a forest. It will also allow for more natural calculation of multi rooting cost.
A lot of traversal code will need to be re-implemented during this change. Afterwards, the interface should be stable and highlevel. There might be more opportunities for efficiency gains by using the foldl or monad-memo libraries in the tree traversals to build the updated vector and character matrix when using this representation.
One thing that came up when writing the code to generate tests for candidate network edges (in this issue: https://github.com/amnh/PCG/issues/105) is some of the pain in re-indexing problems. If we had two networks N0 and N1 and wished to form the new network
o
/ \
/ \
N0 N1
Then in the above indexing scheme we would need to re-index the leaf nodes in N1 by the offset of the number of leaf nodes in N0 and then to re-index the internal nodes in N1 by number of total nodes in N0 (or via some other scheme). Given that very often we will wish to take a subtree/subgraph and perform work on it and then re-insert it into a larger graph it is worth thinking about a good way to represent this situation. For instance we could have something like:
ReferenceVector
{ references :: {-# UNPACK #-} !(Vector (IndexData e n))
, offset :: Int
}
[ or even:
data ReferenceVector =
ReferenceVector
{ internalReferences :: {-# UNPACK #-} !(Vector (IndexData e n))
, internalOffset :: Int
, leafReferences :: {-# UNPACK #-} !(Vector (IndexData e n))
, leafOffset :: Int
}
]
Then we would need a custom Indexable instance which returns for instance a leaf node if we index into an offset vector by something in [0...n-1]. I'm not sure if this is quite the correct approach but something like this is worth thinking about.
After a preorder pass, we assign "resolution cache" values to each node, and each resolution cache has a single character sequences in it. We were doing this because PhylogeneticNode requires a resolution cache and we didn't want to create a different node and DAG type for the post order and preorder passes.
However, @Boarders had a great idea of useing a Functor parameter for the DAG. The Functor on the postorder pass can be a NonEmpty and on the preorder pass it can be collapsed to Identity.
We should consider this technique in conjunction with TypeFamilies to simplify the type machinery surrounding the traversals over the graph.
Lets try and pull Bio.Graph.Constructions out of pcg-core:pcg-data-strcutures and into it's own sub-library or, even better, only defined and consumed in app/pcg/.
We should strongly consider representing the leaf nodes and the internal nodes in separate vectors with separate type parameters. This will allow us to express in the type system that the internal nodes may have different data annotations than the leaf nodes. This is often the case.
-
Upon reading data in from files, the internal nodes have no data annotations (type:
()) and the leaf nodes do have data derived from the file inputs. -
After a postorder pass, the internal nodes will contain
ResolutionCachevalues and the leaf nodes, while they could be fitted with singletonResolutionCachevalues, they could also be represented as being just the leaf node decoration with no cache annotations. -
After a preorder pass, the internal nodes no longer have
ResolutionCachevalues. These values are collapsed based on the optimal display tree for each block. The will have "raw" node decorations.
If the internal nodes are represented independently of the leaf nodes, we could use TypeFamily application and a parameterized Functor to represent the variance of the internal nodes. For example, Void for a graph that was just read in, ResolutionCache for the postorder result, and Identity for the preorder result.
Consider leveraging the Ix typeclass and Array the package when interacting with leaf, vertex, (root?) vectors as separate or combined objects.
I believe that we want to have separate internal node data and leaf data values.
This would mean making the graph object a Functor, Applicative, and Monad with respect to the leaf data and it would be a Bifunctor and Biappliacative in the internal nodes.
We should explore BiFoldable and Bitraversable a well.
G m f e n t
-
Gis the type -
mis the embedded monad for monad transformer -
fis the "structural functor" of the graph -
eis the edge datum -
nis the internal node data, this is theBifunctorpameter -
tis the terminal node data, this is theFunctorparameter
We want G to be a monad transformer, with the monad m.
We will use the "structural functor" f to handle the "resolution caches" internally to the graph so that these structures don't pollute the data variables e, n, or t. @Boarders and I discussed Identity for initial graphs and result graphs, and NonEmpty for the post order & pre-order to represent the resolution caches.
e The edge data wouldn't be tied to any core Haskell type-class.
We want n to represent the internal node data. This includes the root nodes, "tree nodes," and "network nodes." It excludes the terminal (leaf) nodes. It should be part of Bifunctor, Biapplicative, Bifoldable, and Bitraversable if at all possible.
We want t to represent the terminal (leaf) node data. It should be part of the standard Functor, Applicative, Monad, and maybe MonadZip or MonadFix.
A rough idea of how the parse result, post-order, pre-order, and final results might look:
parseResult :: G m Identity () LeafDec
postOrderResult :: G m NonEmpty PostDec LeafDec
finalResult :: G m Identity FinalDec FinalLeaf
postorder
:: (a -> g b)
-> (b -> b -> b)
-> G m f c a
-> G m g b a
preorder
:: Comonad f
=> (b -> d)
-> (d -> b -> d)
-> (d -> a -> c)
-> G m f b a
-> G m Identity d c