hoogle
hoogle copied to clipboard
Use discrimination trees for type search
There is a data structure called a discrimination tree that allows one to compactly store a large number of terms (rose trees) with metavariables and then, given another term with metavariables, efficiently filter the set of contained terms down to a subset that may unify with the given term.
Discrimination trees are, at their core, just tries whose nodes are of type Maybe Node
where Node
is a type with one nullary constructor for each AST (term) constructor. Adding a term to a discrimination tree involves doing a preorder traversal of the term to get [Either Var Node]
and then mapping either (const Nothing) Just
and inserting the result into the underlying trie. Querying the discrimination tree involves a backtracking search. The leaves of the trie should have the set of terms whose preorder traversal is equivalent to that path through the trie.
If you hash-cons the discrimination tree, it can take up less space, which would be desirable for an on-disk representation, but may not be worth the effort. If you want to store a discrimination tree on disk, it would probably be wise to use Word32
offsets instead of pointers (or Word16
if the tree is small enough, which it will often be if someone is running the Hoogle indexer on lots of small packages and then combining the indexes). Discrimination trees are really just about filtering a large set of terms to yield a subset that contains the set that unify with some term, so if you have an implementation of unification that handles Haskell constraints, you can easily handle searching with constraints by running the unification algorithm on each term the discrimination tree outputs and throwing out the ones that fail to unify due to unsatisfied constraints.
This problem in general is called term indexing. There are many data structures that can be used for term indexing, and a discrimination tree is one of the simplest. There are other options, like substitution trees, that may be more efficient. You may want to consult the resources below for more details:
- Substitution Tree Indexing
-
Handbook of Automated Reasoning: Chapter 26: Term Indexing
- This is an extremely useful resource.
- If you need a copy of this, I may be able to help you.
-
Discrimination Nets: Improvement and Extension to Bang Graphs
- Page 28 is the relevant area.
If you implement discrimination trees for Hoogle, please consider putting the result of your efforts as a separate package on Hackage, as there is no package existing for them. I am working on an implementation (permalink) but it may take some time for it to be ready.
That sounds like a good thing to use - I'd welcome help integrating it with Hoogle, and a plug-in library doing the heavy lifting would be awesome.
Another thing to note: to handle rank-n types, which may have forall
binders in them, we must erase all the forall
binders (when indexing and when querying) by lifting them to the top level, since most/all term indexing data structures only work for terms without binders. This will give us some false positives, since querying the discrimination tree for forall s. (Int -> ST s) -> Int
will yield (forall s. Int -> ST s) -> Int
(assuming the latter has been inserted). The false positives can then be eliminated via unification. Since a rank-a
type will never unify with a search query that is rank-b
where a ≠ b
, it may make sense to have two discrimination trees in the index: one for rank-1 terms, and one for rank-2 and up. It may also make sense to have more discrimination trees for different levels of higher-rank terms (one for rank-1, one for rank-2, one for rank-3, etc.); the tradeoff we are making is that the result of a querying will be more accurate but the size of the index may be larger (since splitting one discrimination tree into two will prevent sharing of trie nodes between similar terms).
You really don't need to worry about rank-2 types - Hoogle already eliminates them already. They just don't occur enough in Haskell to care about.
Some implementations of term-indexing data structures I found:
-
Twee.Index
- This is a discrimination tree written in Haskell. Not sure about its performance though....
-
Z3Prover/z3@76eb7b:src/ast/substitution/substitution_tree.cpp
- This is a substitution tree. The fact that it's included in Z3 indicates that the efficiency of this data structure probably lives up to the claims made in its paper!
More potentially relevant papers:
- A content-based mathematical search engine: Whelp
-
Fast Term Indexing with Coded Context Trees
- The bottom of page 2 of this paper has a lucid explanation of substitution trees.
I translated some SML code for discrimination trees written by Frank Pfenning into Haskell here: https://gist.github.com/taktoa/7a4d77ebc3a312dd69bb19199d30863b Feel free to make use of it; I don't currently have the bandwidth to work on this.
I don't have the bandwidth either, but if anyone else is interested, go for it.
I just ran into this when trying to show someone how to find the function to turn [IO ()] -> IO [()]
(aka sequence
with t ~ [], m ~ IO, a ~ ()
) which hoogle was unable to find. If I understand correctly, the use of a discrimination tree would have resulted in a match here.
Also, I just realized that even searching for sequence
's signature directly as t (m a) -> m (t a)
returns no results (https://hoogle.haskell.org/?hoogle=t+%28m+a%29+-%3E+m+%28t+a%29&scope=set%3Astackage) which is very surprising.
I'm not familiar with the Hoogle codebase so wouldn't be sure on how or where to integrate what @taktoa has written so far. I may have some spare time in the coming weeks to look into this, so if you could share some code pointers that would be really helpful :)
Yes, I believe using discrimination tree matching would fix that issue.
Some pointers on my code (I don't know a huge amount about the hoogle codebase either):
-
Term v f
is a rose tree of either arbitrary arity functions named byf
, or variables named byv
. -
DT v f
is a discrimination tree ofTerm v f
s. -
insert
will allow you to insert into a discrimination tree, which will be necessary during indexing. -
getGen
returns all the terms in the discrimination tree that are more general than the given query term;getInst
returns all the terms in the discrimination tree that are more specific (instances of) the given query term. For a basic implementation,getInst
is all you should need, but it could be cool to have an option in the UI for searching for terms more general than the query term. - I think it will vastly improve search quality if you can postfilter the output of the discrimination tree by accounting for typeclass data (e.g.: so that a query for
[Set Int] -> Set [Int]
never returnssequence
, sinceSet
has noMonad
instance), but will be quite a bit more work.
The way Hoogle works is that it has one pass where it reads all the types and saves them in a database. The code for that is at https://github.com/ndmitchell/hoogle/blob/master/src/Output/Types.hs#L44-L52. Given all the right information, you need to serialise it to a file. That information must all be serialised using the Store types, https://github.com/ndmitchell/hoogle/blob/master/src/General/Store.hs, which as basically vectors of bytes we mmap in later.
Later on you get that information back through StoreRead, and a query type, and have to spit out the most likely targets. That code is slightly further down in https://github.com/ndmitchell/hoogle/blob/master/src/Output/Types.hs#L54.
Good luck!