go-set-similarity-search
go-set-similarity-search copied to clipboard
Efficient set similarity search algorithms implemented in Go
Set Similarity Search in Go
This is a mirror implementation of the Python SetSimilaritySearch library in Go, with better performance.
Benchmarks
Run AllPairs
algorithm on 3.5 GHz Intel Core i7,
using similarity function jaccard
and similarity threshold 0.5.
Dataset | Input Sets | Avg. Size | go-set-similarity-search Runtime |
SetSimilaritySearch Runtime |
---|---|---|---|---|
Pokec social network (relationships): from-nodes are set IDs; to-nodes are elements | 1432693 | 27.31 | 1m25s | 10m49s |
LiveJournal: from-nodes are set IDs; to-nodes are elements | 4308452 | 16.01 | 4m11s | 28m51s |
Library Usage
For All-Pairs, it takes an input of a list of sets, and output pairs that meet the similarity threshold.
import (
"fmt"
"go-set-similarity-search"
)
func main() {
// Each raw set must be a slice of unique string tokens.
rawSets := [][]string{
[]string{"a"},
[]string{"a", "b"},
[]string{"a", "b", "c"},
[]string{"a", "b", "c", "d"},
[]string{"a", "b", "c", "d", "e"},
}
// Use frequency order transformation to replace the string tokens
// with integers.
sets, _ := SetSimilaritySearch.FrequencyOrderTransform(rawSets)
// Run all-pairs algorithm, get a channel of pairs.
pairs, _ := SetSimilaritySearch.AllPairs(sets,
/*similarityFunctionName=*/"jaccard",
/*similarityThreshold=*/0.1)
for pair := range pairs {
// X and Y are indexes to the original rawSets and sets slices.
fmt.Println(pair.X, pair.Y, pair.Similarity)
}
}
For Query, it takes an input of a list of sets, and builds a search index that can compute any number of queries. Currently the search index only supports a static collection of sets with no updates.
import (
"fmt"
"go-set-similarity-search"
)
func main() {
// Each raw set must be a slice of unique string tokens.
rawSets := [][]string{
[]string{"a"},
[]string{"a", "b"},
[]string{"a", "b", "c"},
[]string{"a", "b", "c", "d"},
[]string{"a", "b", "c", "d", "e"},
}
// Use frequency order transformation to replace the string tokens
// with integers.
sets, dict := SetSimilaritySearch.FrequencyOrderTransform(rawSets)
// Build a search index.
searchIndex, err := SetSimilaritySearch.NewSearchIndex(sets,
/*similarityFunctionName=*/"jaccard",
/*similarityThreshold=*/0.1)
// Use dictionary to transform a query set.
querySet := dict.Transform([]string{"a", "c", "d"})
// Query the search index.
searchResults := searchIndex.Query(querySet)
for _, result := range searchResults {
// X is the index to the original rawSets and sets slices.
fmt.Println(result.X, result.Similarity)
}
}
Supported similarity functions (more to come):
-
Jaccard: intersection size divided by union size; set
similarityFunctionName="jaccard"
. -
Cosine: intersection size divided by square root of the product of sizes; set
similarityFunctionName="cosine"
. -
Containment: intersection size divided by the size of the first set (or query set); set
similarityFunctionName="containment"
.