Experiment with Haskell-side, hashable memoized function(s)
Experiment with ST, TVars, and HashTable libraries to implement a Haskell-side memoization library for Hashable hashable structures. See if this is more efficient than the FFI binding to C++.
We want a type signature that looks like this:
(Eq a, Hashable a) => (a -> b) -> a -> b
See this commit 9271b8ed47fffcfe663058811d71d6b8e50344d6 where I removed the old, unsafe IO version.
There is a branch haskell-function-memoize in which I added an initial HashTable, ST, and TVar implementation. It currently doesn't compile because of a type error involving the s state type in the ST monad. I'm hoping @Boarders can help me look at this and see if it can be rectified.
I switched to an IO based implementation. It kinda seems to work... in a finicky way. I'll try shoving this in our codebase and measure the performance against the FFI-based memoized TCM.
The experiment was inconclusive. There is a fold over the bit vectors (DynamicCharacterElement) to perform the pairwise or threeway Sankoff algorithm for the cost and median that takes up all the time when using the native Haskell HashTable embedded in IO and a TVar. Optimizing the fold over the bit vector to allocate far less memory will be required for this to be viable.
We are de-prioritizing this fold optimization for the time being. This should be revisited at a later date. See the overlap, overlap2, & overlap3 functions in 971d8cf74e584ebd80ab9084f84074e09423b4f0.
After completing #141, we can use the benchmarks to quantify if this Haskell memoized TCM is better for usage with string-alignment median lookups.
We have moderate confidence that the memoize function is working as intended and is thread safe.
However, the memoized function is not being retained and reapplied between string alignment functions. The consequence is that the memoization work is discarded between each string alignment.
There may be a technique to remedy this. The hypothesis is that because we define the pairwise "overlap" function though a Getter that special cases a MetricRepresentation value, it re-creates the memoized function on each call to the getter even though the input and output are the same.
We are considering embedding the memoized function into the MetricRepresentation like so:
data DynamicCharacterMetadataDec c
= DynamicCharacterMetadataDec
{ optimalTraversalFoci :: !(Maybe TraversalFoci)
, structuralRepresentationTCM :: !(Either
(DenseTransitionCostMatrix, MetricRepresentation ())
(MetricRepresentation (c -> c -> (c,Word))) -- ^ Changes here
)
, metadata :: {-# UNPACK #-} !DiscreteCharacterMetadataDec
} deriving (Generic, NFData)
Because this would involve having a function in the "metadata sequence," we could no longer utilize compact regions. This is a potentially large downside.
The upside is that, if successful, we would not require a C++ FFI binding, could simplify the Exportable type-class, and further modularize the sub-library build architecture.
This hypothesis would constitute a fairly decent effort to test.
We should also look into using concurrent-hashtable.
I have added the memoized functions to the metadata and excised the usage of compact region from the code. However, this has broken the SAVE and LOAD commands, due to these using compact regions to generate the save states.
We are switching to Binary instances from the binary package to substitute the functionality. Unfortunately this requires writing a lot of Binary instances for almost every data-type in our codebase (either explicitly or through a deriving mechanism). It constitutes to a fair amount of mechanical, straightforward work.