guava
guava copied to clipboard
Trie interface(s) and implementation(s)
Original issue created by sberlin on 2007-09-24 at 06:12 PM
This is a contribution of LimeWire's PatriciaTrie, as discussed at: http://groups.google.com/group/google- guice/browse_frm/thread/ffb2a3b3b9e39e79?tvc=1 .
The files can be licensed as necessary (we own the copyright and can change/transfer the license). I'm not sure what license, if any, these would need to be for inclusion.
Original comment posted by kevinb9n on 2007-10-01 at 07:51 PM
thanks!
It would be really, really (really) helpful if we could collect, here, a good variety of use cases for this trie.
Original comment posted by sberlin on 2007-10-01 at 08:10 PM
We use it internally for a few cases. 1) An IP list. We have a KeyAnalyzer that analyzes ip addresses to store them (and/or ranges of them) as compactly as possible. See: https://www.limewire.org/fisheye/browse/~raw,r=1.18/limecvs/core/com/limegroup/gnutel la/filters/IPList.java .
2) Storing a Kademlia DHT's internal structure. This is probably a very specific use-case, although it works very well. :)
Other use cases: 3) A very efficient (both in memory & CPU) dictionary. It could be used for something like a cellphone's phonebook. You start typing and it immediately finds all strings that began with your string. At the same time the structure acts like a Map preventing you from adding the same String twice. It's kind of like a super- SortedMap.
I'm sure there's a ton of other cases, but those are the ones we use (and have plans
to use). The really awesome things about Patricia is the way it does lookups.
Consider the IPList -- IPv4 addresses have a fixed size of 32 bits. So no matter
how many items you have stored in the Trie, it will always do at most 32 bit
comparisons. And it will often do less (because it intelligently traverses the
tree), and it will keep things sorted and tell you all addresses that begin with
X.Y, etc...
Original comment posted by kevinb9n on 2007-10-16 at 03:37 PM
Great stuff, folks. Thanks so much.
With the sheer mountain of work involved in getting our existing stuff polished, I'm afraid it may be some time before I can look at breaking into new territory. But, I do think that this trie and tries in general are a very promising addition, so don't lose hope!
Status: Accepted
Owner: kevinb9n
Original comment posted by kevinb9n on 2007-10-23 at 04:29 AM
Disclaimer: I don't know much about Tries.
I can imagine that many use cases for a Trie break into at least two categories
- The SortedMap/NavigableMap the right API abstraction that the caller wants, but due to the nature of the data, a Trie can be much faster than existing implementations of these interfaces.
- What the caller wants to see doesn't look like a Map at all, but something much simpler that deals only with prefix-matching like (rough guess):
public interface Trie<N, V> { boolean containsMatch(List<? extends N> sequence); V get(List<? extends N> sequence); // doesn't require exact match void put(List<? extends N> sequencePrefix, V value); }
A separate issue: should we consider providing an alternate interface that is tailored to character-based tries, so this alternate API could use CharSequence in place of List<Character>? Then there would be two simple methods Tries.asCharacterTrie() and Tries.forCharacterTrie() to go back and forth between the two.
Original comment posted by sberlin on 2007-10-23 at 05:49 AM
From my experience, #1 is exactly right. A Trie easily doubles as a super-efficient SortedMap/NavigableMap.
You hit the nail on #2 with the CharSequence version. Trie's I've encountered function with a similar interface, but tailored towards the combined sequence instead of the pieces that build up the sequence. That is, the methods would look like:
public interface Trie<K, V> { boolean hasPrefix(K prefix); SortedMap<K, V> getPrefixedBy(K prefix); // returns a view over all entries prefixed by K void put(K key, V value); }
The difference is the Trie is directly acting off a K value, which would be the CharSequence to a Character, or a NumberSequence to a Number. This has the advantage of being able to reuse the interface for arbitrary-length keys or fixed- length keys whose contents don't divide easily into objects. (For example, IP addresses being composed of bits, CharSequences, phone numbers, etc...)
The Trie interface in the submission includes a few additional convenience methods and directly subclasses SortedMap (if it were targetted for 1.6, it would extend NavigableMap too). The additions are for better use with fixed-size keys and being able to visit the entries in the map while traversing (with a 'Cursor').
Original comment posted by sberlin on 2007-10-23 at 05:50 AM
(Really, that getPrefixedBy could just return a List<V> - having it return a vw of the map itself enables some really cool things.)
Original comment posted by kevinb9n on 2007-11-03 at 05:32 PM
(No comment entered for this change.)
Labels: Post-1.0
Original comment posted by kevinb9n on 2008-06-02 at 05:48 PM
(No comment entered for this change.)
Labels: -Post-1.0
Original comment posted by tim.frey.online on 2009-03-22 at 10:31 AM
Hello,
currently I'm wirting something for my studies and did a bit of research about Tries. I fall upon a Hash Table performance like Trie called HAT-Trie. It's a proposal for a fast memory aware Trie that can deal with non equal memory costs. http://crpit.com/confpapers/CRPITV62Askitis.pdf
In my opinion it maybe would be a great advantage to use this Trie.
Hope I could help, Tim
Original comment posted by ray.a.conner on 2009-08-03 at 05:53 PM
To completely generalize it, a Trie is kind of a particular way of representing a Set< List< E > > or, easily enough, a Map< List< E >, V > where E is typically a character (List<E> is a String). The representation allows certain specific operations to be very efficient, and is efficient for storage when there are many common prefixes among the List<E> (a spelling dictionary, for example).
There may be some benefit to completely abstracting it in this way, although translating the common use case of a String key would be an excessive amount of overhead.
Original comment posted by ray.a.conner on 2009-08-03 at 06:05 PM
As a use case, I've used it for finding phrases within a large text document. I have some set of phrases that I'm looking for (not Strings, but sequences of words, hence
- List<String>) stored as a Trie. To search, I walk the normalized text (split on whitespace, punctuation removed, handle case insensitivity, etc.) which is also a List<String>, keeping track of possible matches as I go.
Generalizing, this is finding all possible subsequences of List<E> (the text) that are members of a given Set<List<E>> (the search phrases).
The big benefit is that the text, which can be very large, only needs to be traversed once to find any one (or all) of thousands of possible phrases. It scales very well.
Original comment posted by creswick on 2009-08-03 at 06:31 PM
I have the same use case as Ray.a.conner, and I'd also use it for scalable tab|auto- completion in interactive consoles|text fields.
Original comment posted by sberlin on 2009-08-03 at 06:34 PM
Are you folks using the attached PatriciaTrie implementation, or a separate one?
Original comment posted by jared.l.levy on 2009-08-03 at 06:44 PM
The Google code base includes multiple Java trie implementations. If we chose to add a trie to the library, we'd probably start with one of those.
Labels: -Type-Defect, Type-Enhancement
Original comment posted by ray.a.conner on 2009-08-03 at 08:13 PM
I rolled my own implementation, some years ago. I wasn't storing simple strings, and pretty much everything out there was character-centric. Once I made the abstraction to sequences of strings, sequences of objects was a no-brainer. Although I never needed to implement many of the Trie operations that others find useful; I only needed sub-sequence matching.
Original comment posted by kevinb9n on 2009-09-17 at 06:02 PM
(No comment entered for this change.)
Labels: Milestone-Post1.0
Original comment posted by master.java.xiao on 2009-11-24 at 08:06 AM
(No comment entered for this change.)
Original comment posted by [email protected] on 2010-07-30 at 03:54 AM
(No comment entered for this change.)
Owner: [email protected]
Labels: -Milestone-Post1.0
Original comment posted by [email protected] on 2010-07-30 at 03:56 AM
(No comment entered for this change.)
Labels: -Priority-Medium
Original comment posted by [email protected] on 2011-01-27 at 01:59 PM
(No comment entered for this change.)
Owner: ---
Original comment posted by [email protected] on 2011-02-03 at 06:24 AM
Work is beginning... expect in maybe release 10 or 11.
Status: Started
Labels: Milestone-Release10
Original comment posted by rkapsi on 2011-02-24 at 10:35 PM
Awesome! I don't know if it's of any use for you but our Trie has been under ASL 2.0 for a few years now and has undergone a few refacorings.
<http://code.google.com/p/patricia-trie> <https://github.com/rkapsi/patricia-trie> <https://github.com/rkapsi/simple-patricia-trie>
The Trie interfaces are maybe useful even if you're not going to use PATRICIA.
Cheers, Roger
Original comment posted by jim.andreou on 2011-02-25 at 02:35 AM
Yep, will certainly take this into consideration, thanks for the links. I find the last method (#prefixMap(prefix)) especially nifty, exposing the recursive structure of a trie (though I would expect to see the self type being returned, instead of SortedMap, that's lossy).
That said, I have to admit that personally I'm a bit predisposed against patricia. My reasoning is that for the same kind of performance, we could go with a ternary trie, which is more flexible (e.g. no key analyzer, not necessarily unbalanced trees). But I don't trust what "that kind of performance" means for big tries. I'm still going through the literature hunting for other options (e.g. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.3499, http://www.naskitis.com/)
Original comment posted by sberlin on 2011-02-25 at 02:55 PM
It probably could return the Trie itself, but would require a bit more work to make sure the additional trie operations stayed with the sub-trie range. Roger & I ate, slept & breathed patricia for a couple weeks while putting this together years ago.. so don't hesitate to ask if you have any questions on it.
Original comment posted by wasserman.louis on 2011-03-07 at 05:41 PM
I'd be interested in working on this -- heh, I've been living and breathing http://hackage.haskell.org/package/TrieMap for some time now.
Original comment posted by [email protected] on 2011-03-23 at 01:48 AM
(No comment entered for this change.)
Labels: -Milestone-Release10
Original comment posted by [email protected] on 2011-07-13 at 06:19 PM
(No comment entered for this change.)
Status: Accepted
Original comment posted by [email protected] on 2011-12-10 at 03:11 PM
(No comment entered for this change.)
Labels: Package-Collect
Original comment posted by ian.g.simpson on 2012-03-10 at 04:48 AM
Any update on this? I just checked out 11.0.2 and I still don't see a Trie implementation in it.
Original comment posted by [email protected] on 2012-03-11 at 11:08 PM
There's been some progress, but unfortunately it's not high-priority. :-( Unlikely for 13.0, maybe 14.0.