Elaborate on fuzzy matching and autocomplete?
From the README,
Basically, it's a set of strings that lets you test for membership, do spelling correction (fuzzy matching) and autocomplete, but with higher memory efficiency than a regular trie.
Can you please elaborate on how this can be used for fuzzy matching and autocomplete?
As far as I can tell, you can only test for strict presence of a full string. It's not possible to check if a substring or a fuzzy match exists in the set, is it?
Hmm, reading up on tries, I think I see the general direction... Would I use one of Traverse(word []rune) funcs to check if something is a substring and find possible completions?
Yeah, I think a tutorial or at least a more elaborate example will help a lot. I tried to use the library and the behavior I get is hard to understand.
I could whip something up later along the lines of a traversal tutorial, but right now, I'm pretty swamped.
You're on the right track though. You can use the code in Traverse() as a starting point. Basically you start at the root node, and follow the rune slice down to children nodes that match each rune. It depends what your exact application is, but if you hit a node that doesn't have any outgoing edge to match the next rune, you can simply try each one and then see if the next character matches an edge on the next node, etc. Exactly how you do this depends on whether you are trying for an insertion, deletion, transposition, or substitution. Look up "edit distances" for more details.
We wanted to separate traversal, which is application-specific, from the core data structure which should only be about building and providing access to its elements. Plain traversal is trivial, and the nodes and edges are exported so you can do what you want. Indexed traversal is a little harder, and that's why I think I need to add one more function that could be useful, and that is a function like Traverse or IndexedTraverse except where it traverses only as far as it is able to match the input, and it guarantees that it always returns a node (whereas the other ones may return nil if it can't find the string in the tree). It would also have to return the prefix it was able to match.
Does that make sense?
Thanks @mholt I for one, and I imagine many other potential users of the library, am ignorant of the technical details. Let me explain to you what I did, and what I got. I loaded a large number of titles (imagine book titles) and saved it to disk and it was very very small, and loaded it and could not get back anything. Then I tried narrowing things down to a very small number of items and it kinda worked. Obviously I don't understand who it is meant to work and example will help me here. The idea was to create an auto complete for the (book) titles.
I'd be interested in seeing your code that you used to save the tree and load it again (you can email it if you'd rather keep it private). If it's a bug, I'd like to fix it. And since how to use the library obviously isn't clear, I'd like to document it better.
One reason I haven't drafted up more practical tutorials is because the package is still in development. We're using it internally at SmartyStreets and we will learn more about the structure and make it more convenient over time. However, it's worth re-iterating that the package will probably not provide more than rudimentary traversal facilities: meaning that we won't build in functions for specific things like prefix lookups and fuzzy matching, etc.
But until the library matures and if you're willing to experiment with me, I threw together a quick, simple prefix search that you could use for something like an autocomplete. This example does not use the minimal perfect hashing feature (the Index property on each node) and comes with no warranties or guarantees:
func PrefixSearch(t *mafsa.MinTree, prefix string, max int) []string {
var results []string
prefixRunes := []rune(prefix)
startNode := t.Traverse(prefixRunes)
if startNode == nil {
return results
}
if startNode.Final {
results = append(results, prefix)
}
return prefixSearch(results, max, prefixRunes, startNode)
}
func prefixSearch(results []string, max int, word []rune, node *mafsa.MinTreeNode) []string {
// Using the edges in order gives deterministic output
keys := node.OrderedEdges()
for _, char := range keys {
child := node.Edges[char]
if child.Final {
results = append(results, string(word) + string(char))
}
if max < 0 || len(results) < max {
results = prefixSearch(results, max, append(word, char), child)
} else {
break
}
}
return results
}
The minimal perfect hashing would be useful if you need to associate an entry in your tree with some data in a slice in your application. (This package supports it but I haven't used it in this example.) It needs more documentation as-is; even a simple explanation of the algorithm described in the paper.
Hi @mholt Thank you very much.
Below is what I originally did (abbreviated), today I also added your function (from above) and used it like:
r := PrefixSearch(CC_MT, "Ty", 10)
fmt.Println("PrefixSearch: %#v\n", r)
but got an empty []string back.
func main() {
// READ data from file and put it in XX
var XX []map[string]string
CC_BT := mafsa.New()
for _, m := range XX {
// AT FIRST I tried Full titles , which are multiple words separated by spaces
//CC_BT.Insert(m["full_title"])
// since it failed, I tried word by word, which didn't help either
for _, s := range strings.Fields(m["long_description"]) {
CC_BT.Insert(s)
}
}
CC_BT.Finish()
err = CC_BT.Save("filename")
if err != nil {
log.Fatal("Could not save data to file:", err)
}
CC_MT, err := mafsa.Load("filename")
if err != nil {
log.Fatal("Could not load data from file:", err)
}
// See if a string is a member of the set
fmt.Println("Does tree contain 'Cholera'?", CC_MT.Contains("Cholera"))
fmt.Println("Does tree contain 'Cholera'?", CC_MT.Contains("cholera"))
fmt.Println("Does tree contain 'cholera'?", CC_MT.Contains("Chol"))
fmt.Println("Does tree contain 'Chronic pain syndrome'?", CC_MT.Contains("Chronic pain syndrome"))
// You can traverse down to a certain node, if it exists
fmt.Printf("'y' node is at: %p\n", CC_MT.Traverse([]rune("city")))
// To traverse the tree and get the number of elements inserted
// before the prefix specified
node, idx := CC_MT.IndexedTraverse([]rune("head"))
fmt.Println("Index number for 'pit': %d", idx)
fmt.Println("node is : %#v\n", node)
}
You know you need to insert the items in lexicographical order, right? Make sure to sort your list before you do the insertions, and always check the error value that gets returned in case insertion fails. Right now you're not checking it, which might explain the missing items in your tree. ;)
Iterating a map probably isn't what you want to do since those are in pseudo-random order; you'll have to make sure to iterate in order.
:) I didn't know that. I didn't sort them. I'll try and get back to you if I had any problems. Thank you very much @mholt , really appreciate your help.
@mholt ok, I tried it. The behaviour makes more sense now.
But, If I add 100 items from the top of my sorted list, things look ok, when I get to like a 1000 the same PrefixSearch that was returning results before now does not return anything. And Insert does not return any errors either. Is that expected behavior?
If you want I can email you my list (I can put the code here but I don't want to put the titles file here).
Thanks again.
Thanks for the detailed info, it was helpful!
I've played a little with this package so far. I've learned an interesting finding. I took a 2.5 MB uncompressed JSON with a lot of strings, and compared the marshaled file size when encoded with this package, and with encoding/gob.
using mafsa
1.3MB // uncompressed
706KB // compressed with gzip
using encoding/gob
1.9MB // uncompressed
391KB // compressed with gzip
(You can find code here, but it has hardcoded paths so it's not go-gettable.)
So uncompressed, mafsa is smaller, but when you apply gzip compression, it does take up more space than a simple encoding/gob encoding.
I think that makes sense, because mafsa encodes more information about the data (which gives you the ability to perform fast contains query, and with some custom code, perform fuzzy matching or autocompletion behaviors). Is my understanding accurate?
@alimoeeny If you could email both the code (or a link to it) and list to me, that would be great -- I could take a look. Thanks for trying it out! I want to make it more usable so this will be helpful.
@shurcooL That is interesting! I don't know enough about gzip on the surface to know why that would be the case, but perhaps that's something worth looking into further. I also need to read more into gob.
@shurcooL An initial guess as to why mafsa encodes smaller uncompressed is because it is already compressed once it is built; many duplicate elements are consolidated. It's quite likely that a slice of strings still has plenty of duplication among its elements, so its encoding is larger.
But the gzip compression, I dunno... maybe gzip is better at compressing uncompressed stuff. I have to study how gzip works more, although I doubt it will change how the mafsa structure is encoded.