backbone icon indicating copy to clipboard operation
backbone copied to clipboard

Added Binary Search for Sorted Collections

Open joshuabambrick opened this issue 10 years ago • 20 comments

Added search method to collection prototype as per #3169. This is automatically used by where and findWhere when it is possible to do so to provide speedups without user needing to do anything.

For toFind, the user may provide:

  • a comparison function which compares takes one value and returns 1 if the sought value has higher index, -1 if its lower and 0 if the value matches (comparator unused)
  • a value to search for (the value of the model[comparator] sought - requires comparator to be a string)
  • a model to search for (where comparator is a function)

Users may also request that it is the index of the model that is returned or that the model of highest/lowest index which matches the comparison is sought.

When no match is found, undefined is returned, or if returnIndex is true then -1 is returned instead.

One point that may need to be considered is that the behaviour when a model is input as toFind but the comparator is a string is different from that when comparator is a function. Furthermore, the name of this method may also need some more thought to emphasise that it is for binary searches or that it only works on sorted collections so binarySearch or the more user-friendly sortedSearched might be preferable.

joshuabambrick avatar Jun 25 '14 18:06 joshuabambrick

I do think _.sortedIndex should be delegated to for the same cases as sort delegates to sortBy, in order to prevent possible inconsistencies between this and sort (https://github.com/jashkenas/underscore/issues/1768)

megawac avatar Aug 09 '14 21:08 megawac

@megawac Could you possibly elaborate on this a bit more? I think using _.sortedIndex would require considerable code replication which I think is best avoided if possible. I have read through the issue you linked to but I don't quite follow what this would achieve. Cheers.

joshuabambrick avatar Aug 21 '14 20:08 joshuabambrick

@megawac In fact, looking at the underscore source code, there is no obvious, efficient way to use _.sortedIndex - the iterator is not passed the index of the element currently being looked at which makes it difficult to work with the getMin and getMax options.

joshuabambrick avatar Aug 21 '14 21:08 joshuabambrick

getMin and getMax are trivially interchangeable. Just increment by 1 once and then until x[i] !== x[i+1]

This function has too many unnecessary options and params. I'd ditch those options and make a second functions sortedIndexRight or whatever if necessary.

megawac avatar Aug 21 '14 21:08 megawac

getMin and getMax are the only options at present. Whilst your suggestion is probably fair in the usual case, it does increase the complexity of this function from O(lg n) to O(n) - equivalent to a full loop of the collection when these parameters are set - so I wouldn't refer to them as unnecessary.

In any case, I would quite like to hear a bit more about why you think _.sortedIndex should be used.

joshuabambrick avatar Aug 21 '14 21:08 joshuabambrick

  1. You could easily optimize for the duplicate case but I don't think thats necessary.

  2. code redundancy, simplicity and future-compatibility with _.sortBy

megawac avatar Aug 21 '14 22:08 megawac

future-compatibility with _.sortBy

might be worthwhile, although I would quite like to hear why you think this method could ever become incompatible.

I'm not sure that your suggested change would necessarily improve code redundancy or simplicity. As far as I can see, the binary search code would still need to be there for the case of a two parameter comparator and so all we would be doing is adding more code for the other cases, effectively achieving the same thing by two possible means, rather that just one.

Furthermore, I think whether or not optimising for the case where multiple models compare equally need to be considered. I think this is actually quite a common situation - perhaps even occurring in the majority of cases and I suppose it would be slightly disappointing to not optimise efficiency where it is possible, since optimising efficiency where it is possible was the goal of this method. If you have an alternative strategy for optimising the duplicate cases whilst maintaining the use of _.sortedIndex that may be worth considering.

joshuabambrick avatar Aug 21 '14 22:08 joshuabambrick

  1. Read the underscore issue I linked earlier.

  2. this is a sorted index I don't really see the rational of needing a right sorted index at all. The sort position is the left sorted i ndex +1 except for equal item cases as we've noted. Its not worth the code bloat to handle the edge case efficiently in my opinion

  3. I've always been against exposing options args as they usually go poorly documented and lead to behavior that is confusing. On Aug 21, 2014 6:48 PM, "Josh Bambrick" [email protected] wrote:

future-compatibility with _.sortBy

might be worthwhile, although I would quite like to hear why you think this method could ever become incompatible.

I'm not sure that your suggested change would necessarily improve code redundancy or simplicity. As far as I can see, the binary search code would still need to be there for the case of a two parameter comparator and so all we would be doing is adding more code for the other cases, effectively achieving the same thing by two possible means, rather that just one.

Furthermore, I think whether or not optimising for the case where multiple models compare equally need to be considered. I think this is actually quite a common situation - perhaps even occurring in the majority of cases and I suppose it would be slightly disappointing to not optimise efficiency where it is possible, since optimising efficiency where it is possible was the goal of this method. If you have an alternative strategy for optimising the duplicate cases whilst maintaining the use of _.sortedIndex that may be worth considering.

— Reply to this email directly or view it on GitHub https://github.com/jashkenas/backbone/pull/3207#issuecomment-52995559.

megawac avatar Aug 21 '14 23:08 megawac

It would be great to get a few more opinions on this. If others agree with you, I would be happy to implement your ideas. That being said, if this gets included, I will gladly also write up documentation to ensure that the use of this method is clear - it would surely take less time than that we've already spent discussing it.

I'm also not fully sure what you mean by 'options args' or 'right sorted' but I hope I have interpreted them correctly.

An overview of different viewpoints

@megawac

  • the getMin and getMax options are unnecessary and you should assume that the min is wanted (or max should be a different function)
  • it is not necessary to find the min/max efficiently, as these are edge cases - if you even support finding them at all
  • improved comparison is needed (beyond greater/less) in order to maintain consistency with the sorting carried out by Array.prototype.sort - suggesting delegating to _.sortedIndex for the same cases as Backbone.Collection.prototype.sort delegates to _.sortBy.
  • the use of _.sortedIndex would reduce code redundancy and improve simplicity
  • the option of having toFind as a function, to use as the compare function (see code) should be removed

@joshbambrick

  • getMin and getMax are useful and should not be discarded - as evidenced by their use in Backbone.Collection.prototype.where. I have added code to throw an error if they are both set in order to make their use easier and provide feedback. Adding a separate method for searching for the max would clutter the API, require extra code, and likely result in code repetition.
  • optimising the search for min and max is important as it ensures O(lg n) run time for this function in all cases - and search optimisation is the key goal of this method - furthermore, the liklihood of several models comparing as equal is higher than @megawac believes
  • maintaining consistency with Array.prototype.sort is a 'nice-to-have' but not worth the costs (described below), will not be guaranteed by documentation, and unnecessary, such to the extent that it is not currently support by Backbone or underscore
  • the best way to maintain consistency with Array.prototype.sort, would be adding a comparison function to the underscore api, which can be used by _.sortedIndex - the use of this new method would allow fixing Backbone.Collection.prototype.search by changing a single line
  • since the two argument comparator format could not use _.sortedIndex, as such all current code (except possibly the current strategy to find the min/max) would be necessary, in addition, the code to use _.sortedIndex and more separate code to (now inefficiently) search for the min/max
  • I am open to the idea of removing the option of having toFind as a function. However it is worth considering that it would provide greater flexibility and a nice way for users of the comparator as a function to compare on whatever criteria they like, without having to create a model that compares equal to pass. Furthermore, the cost of this is only a single line, and I am happy to ensure that it is documented well.

I think that @megawac's main issue with the code as it stands is the inconsistency with Array.prototype.sort, as he posted about in an underscore issue. If the idea of adding a comparison method, such as that he wrote in the issue, to underscore is implemented, I think we can easily and neatly sort that out. Even if this method isn't part of the public API, making it available to Backbone, possibly via a "mostly internal" function, like iteratee, would simplify things enormously. Alternatively, since the function is so short, I could just copy an paste it into my pull request, at least until it gets added to underscore.

joshuabambrick avatar Aug 30 '14 17:08 joshuabambrick

It might improve the interface to take the second parameter as getMax and if that parameter is not true, assume that the user wants the minimum.

joshuabambrick avatar Aug 31 '14 23:08 joshuabambrick

bumped test underscore to 1.7.0 to include @megawac's new _.comparator method for consistency with Array.prototype.sort and _.sortedIndex also only have getMax as a boolean, and otherwise efficiently fetch the minimum (with no time big-O complexity cost)

joshuabambrick avatar Jan 03 '15 21:01 joshuabambrick

@joshbambrick that's not 1.7 but the master working branch.

megawac avatar Jan 03 '15 23:01 megawac

you're right. I wasn't quite sure what the protocol was in this situation.

joshuabambrick avatar Jan 04 '15 00:01 joshuabambrick

@megawac @joshbambrick where are we at with this? Worth adding to 1.2.0 or holding off a bit more?

akre54 avatar Feb 17 '15 18:02 akre54

@akre54 I feel we should off on this until underscores methods are improved. It could reduce the bloat from this implementation

megawac avatar Feb 17 '15 18:02 megawac

Ok I'll leave it be for now. Thanks.

akre54 avatar Feb 17 '15 18:02 akre54

we could update the pull request we a local copy of _.comparator, and then update it to use that defined in underscore, when it becomes available.

joshuabambrick avatar Apr 17 '15 15:04 joshuabambrick

see most recent pull request for version with local copy of comparator (and also reset underscore.js version - was a mistake last time)

joshuabambrick avatar Apr 17 '15 15:04 joshuabambrick

could consider renaming to binarySearch to be Java-like, or sortedIndexOf since it does a similar thing to some underscore.js methods

joshuabambrick avatar Apr 17 '15 15:04 joshuabambrick

Any updates on the status of this? If the compare slows the search down too much, as for _.comparator, we could put this back to the original version with plain old >/< comparison.

joshuabambrick avatar Aug 12 '15 21:08 joshuabambrick