words_counted
words_counted copied to clipboard
Add critbit support
May I suggest that you also use the critbit gem and add words to critbit? In this way you can also do search by prefix.
Hi there. Thanks for the suggestion. But I'm not sure how this would be useful this would. It also seems to be a heavy dependency to introduce the gem. I'm a bit reluctant to have any JRuby dependencies here too. What do you think?
Hi Mohamad,
Using Critbit would allow the tokenizer to also show all the tokens in sorted order and do searches by token prefix. It would probably be also more space efficient, as Critbit only stores a prefix once as the image below:
[image: Imagem inline 1]
I agree that you should not bring any JRuby dependencies to your Gem. My idea is for you to allow the user to give you the storage class. By default, your code should use a Hash, however, if the user is on JRuby and she has Critbit installed, then when instantiating a WordsCounted, she could pass the Critbit class. Your code has only to consider that instead of working with Hash, it could work with any class that has the same Hash interface.
What do you think?
Cheers,
2015-10-24 19:37 GMT-02:00 Mohamad El-Husseini [email protected]:
Hi there. Thanks for the suggestion. But I'm not sure how this would be useful this would. It also seems to be a heavy dependency to introduce the gem. I'm a bit reluctant to have any JRuby dependencies here too. What do you think?
— Reply to this email directly or view it on GitHub https://github.com/abitdodgy/words_counted/issues/20#issuecomment-150853009 .
Rodrigo Botafogo Integrando TI ao seu negócio 21-3010-4802/11-3010-1802
Your code has only to consider that instead of working with Hash, it could work with any class that has the same Hash interface.
Thanks. I'm having a hard time visualizing what you mean, but that's because I'm not familiar with CritBit. Do you mind sharing a small example of what the interface would look like? Some example, pseudo-code?
I've made a few changes to the gem in the last couple of weeks. Are you aware of those changes? I've decoupled the Tokeniser and the Counter classes.
Hi Mohamad,
The Hash API should be a subset of the Critbit API. In principle, wherever you use a Hash you could use a Critbit. The main difference is that Critbit will keep the data in sorted order while Hash if you do hash.each will cycle through the data in the order it was inserted. Bellow I show the use of Critbit as a Hash:
crit = Critbit.new
add some key, value pairs to crit
crit["hello"] = 0 crit["there"] = 1 crit["essa"] = 10 crit["Essa é uma frase para armazenar"] = 100 assert_equal(4, crit.size) assert_equal(0, crit["hello"]) assert_equal(1, crit["there"]) assert_equal(100, crit["Essa é uma frase para armazenar"])
fetch the key from crit
assert_equal(0, crit.fetch("hello"))
remove a key, value pair from crit. Given the key it will return
the value and
remove the entry
assert_equal(0, crit.delete("hello")) assert_equal(3, crit.size) assert_equal(nil, crit["hello"]) crit.delete("not there") { |k| p "#{k} is not there" }
assert_raise ( KeyError ) { crit.fetch("hello") } assert_equal("NotFound", crit.fetch("hello", "NotFound"))
crit also accepts complex objects
crit["works?"] = [10, 20, 30] assert_equal([10, 20, 30], crit["works?"]) assert_equal(["works?", [10, 20, 30]], crit.assoc("works?"))
check if keys are stored in crit
assert_equal(true, crit.has_key?("there")) assert_equal(false, crit.has_key?("Not there"))
crit stores data in sorted order, so we can call min and max on crit
assert_equal(["Essa é uma frase para armazenar", 100], crit.min) assert_equal(["works?", [10, 20, 30]], crit.max)
crit also allows for checking value containment
assert_equal(true, crit.has_value?(100)) assert_equal(true, crit.has_value?([10, 20, 30])) assert_equal(false, crit.has_value?("hello"))
method entries returns all entries in the Critbit... same as Hash
assert_equal([["Essa \u00E9 uma frase para armazenar", 100], ["essa", 10], ["there", 1], ["works?", [10, 20, 30]]], crit.entries)
it is possible to change a value for a given key
crit["essa"] = 20 assert_equal([["Essa \u00E9 uma frase para armazenar", 100], ["essa", 20], ["there", 1], ["works?", [10, 20, 30]]], crit.entries)
Critbit also allows to get data by prefix. Lets add some data to a Critbit:
crit = Critbit.new
crit is space efficient and stores prefixes only once and can be used to
find only strings that match a certain prefix
items = ["u", "un", "unh", "uni", "unj", "unim", "unin", "unio", "uninc", "unind", "unine", "unindd", "uninde", "unindf", "unindew", "unindex", "unindey", "a", "z"]
add items to the container
items.each do |item| crit[item] = item end
Let´s now retrieve only data that has prefix ‘unin’
crit.prefix = "unin"
Does each for all elements in the critbit with prefix 'unin'
print "[" crit.each do |key, value| print "[#{key}, #{value}] " end print "]"
This is the result:
[[unin, unin] [uninc, uninc] [unind, unind] [unindd, unindd] [uninde, uninde] [unindew, unindew] [unindex, unindex] [unindey, unindey] [unindf, unindf] [unine, unine] ]
Does that help?
2015-10-27 21:35 GMT-02:00 Mohamad El-Husseini [email protected]:
Your code has only to consider that instead of working with Hash, it could work with any class that has the same Hash interface.
Thanks. I'm having a hard time visualizing what you mean, but that's because I'm not familiar with CritBit. Do you mind sharing a small example of what the interface would look like?
I've made a few changes to the gem in the last couple of weeks. Are you aware of those changes? I've decoupled the Tokeniser and the Counter classes.
— Reply to this email directly or view it on GitHub https://github.com/abitdodgy/words_counted/issues/20#issuecomment-151675382 .
Rodrigo Botafogo Integrando TI ao seu negócio 21-3010-4802/11-3010-1802