marisa-trie icon indicating copy to clipboard operation
marisa-trie copied to clipboard

Allow to use arbitrary sequences as elements, not only strings

Open dragoon opened this issue 10 years ago • 4 comments

I tried to construct the following trie:

trie = marisa_trie.Trie([('New', 'York'), ('New', 'Castle')])

Which gave me AttributeError: 'tuple' object has no attribute 'encode'. So I suppose the library accepts only strings, but sometimes you want other structures.

dragoon avatar Jul 02 '14 14:07 dragoon

Have you tried using the RecordTrie instead? (same module)

derpston avatar Jul 02 '14 14:07 derpston

I don't really understand this structure, it has some keys and values, while I have only values.

dragoon avatar Jul 02 '14 15:07 dragoon

Ah, I see what you mean now, disregard my earlier comment. Yeah, as far as I'm aware it only accepts unicode strings.

derpston avatar Jul 02 '14 15:07 derpston

@dragoon I'm not sure adding support for having any object as a key is a good idea - because I don't know how to implement it efficiently.

We can't store just an id of object (it defeats the purpose of marisa-trie), so we should somehow serialize the key to bytes to use it as a key. For strings the wrapper encodes unicode input to utf8.

In order to support arbitrary objects we may use pickle, but I'm not sure how compressable is the result, and better task-specific serialization methods usually exists. For example, in your case (a tuple with 2 strings) it makes sense to join the strings using some separator before adding to the trie and split by this separator when retreiving. You don't need marisa-trie support to do this.

But that's true that there are some edge cases (separator inside the tuple element?), splitting/joining tuples could be more efficient if implemented in Cython, and storing tuples of strings is quite common. So I think adding a trie subclass that allows tuples of strings as keys is a good idea - ngram storage is a common use case. Pull requests are welcome :)

kmike avatar Jul 02 '14 15:07 kmike