fastutil icon indicating copy to clipboard operation
fastutil copied to clipboard

Add SortedArrayMaps?

Open pburka opened this issue 7 years ago • 13 comments

I have an application where we need many sorted maps of longs to Objects, and we wish to use a very space efficient implementation. We use a custom map backed by a sorted long[] and a corresponding Object[]. This is very similar to fastutil's Long2ObjectArrayMap, except that it provides a NavigableMap interface. Entries are found using a simple binary search. In our expected use case, keys are added to the map in order, so we're not worried about the performance of adding or removing arbitrary keys.

Would it make sense to add this functionality to fastutil? e.g. Long2ObjectSortedArrayMap

pburka avatar Jun 21 '17 14:06 pburka

That'd be nice. Wanna try a pull request? First we should fastutilize NavigableMap/Set tho (sigh).

vigna avatar Jun 21 '17 14:06 vigna

Agreed. Without the NavigableMap interface it's not of much use to me. I'll see if I can make some progress on that.

pburka avatar Jun 21 '17 14:06 pburka

We've got an implementation of this over in LensKit too, with few nice features:

  • The long[] storage is managed by a general KeyIndex data structure that allows code reuse between sets and maps.
  • The key index abstracts over long[] and int[] maps, so that if all keys of a map fit in integers, they are used to save space.
  • Long2DoubleSortedArrayMaps are immutable.
  • They expose their underlying sorted-array-ness to enable high-speed dot products and similar operations.

We plan to soon relicense LensKit under the MIT license (just waiting for the last paperwork to clear), at which point our source code will be reusable in fastutil if desired.

mdekstrand avatar Jun 22 '17 08:06 mdekstrand

I started working on this today by prototyping a NavigableSet template for fastutil. I ran into one complication: a number of NavigableSet methods can return null to indicate absence (e.g. ceiling(e) returns null if there is no element greater than or equal to e).

I believe the full set of affected methods is:

  • ceiling
  • floor
  • higher
  • lower
  • pollFirst
  • pollLast

In Maps, you deal with this by adding a defaultReturnValue(). We could do the same thing in NavigableSet, but (a) it means adding additional state to each implementor, and (b) I'm not convinced that it's the best solution.

Since functions like floor() and ceiling() deal with inequalities, a user may, perhaps, prefer to receive MAX_VALUE as the 'absent' signal for one, and MIN_VALUE for the other.

How would you feel about a fastutil NavigableSet which supplemented Long ceiling(Long e) with long ceiling(long e, long ifNone), etc?

/peter

On Wed, Jun 21, 2017 at 10:55 AM Sebastiano Vigna [email protected] wrote:

That'd be nice. Wanna try a pull request? First we should fastutilize NavigableMap/Set tho (sigh).

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/vigna/fastutil/issues/73#issuecomment-310104899, or mute the thread https://github.com/notifications/unsubscribe-auth/ADxoy7_o03NnZdkaQwQfv8KYkmVTPh4jks5sGS7egaJpZM4OBCOq .

pburka avatar Jun 24 '17 21:06 pburka

I fully understand your problem. A default return value is sensible for a non-sorted operation such as contains(), but it's utterly counterintuitive when order is involved. I think your proposal is reasonable, but let's season it for a couple of days. I'll think about it.

vigna avatar Jun 25 '17 18:06 vigna

I've prototyped NavigableSet, NavigableMap, and some helpers in NavigableSets in a fork of the repository at https://github.com/pburka/fastutil/tree/master/ https://github.com/pburka/fastutil/tree/master/drv

I chose to append 'OrDefault' to each of the APIs which could return null, as this seemed consistent with the new Java 8 API, Map.getOrDefault().

Take a look and tell me what you think.

/peter

On Sun, Jun 25, 2017 at 2:36 PM Sebastiano Vigna [email protected] wrote:

I fully understand your problem. A default return value is sensible for a non-sorted operation such as contains(), but it's utterly counterintuitive when order is involved. I think your proposal is reasonable, but let's season it for a couple of days. I'll think about it.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/vigna/fastutil/issues/73#issuecomment-310919802, or mute the thread https://github.com/notifications/unsubscribe-auth/ADxoy1xqxLBT-ezkAKryfMvE5fcBXYSDks5sHqi5gaJpZM4OBCOq .

pburka avatar Jun 29 '17 00:06 pburka

On 29 Jun 2017, at 02:24, Peter Burka [email protected] wrote:

Take a look and tell me what you think.

I'm a little behind schedule with this, but I'm working on it :).

Ciao,

				seba

vigna avatar Jul 06 '17 06:07 vigna

What about NoSuchElementException? This would nicely apply to all the relevant methods. For pollFirst and pollLast one can easily check whether the exception would be thrown, for the other methods (ceiling, etc.) something like hasLower and hasHigher would be necessary. Those can be defaulted to .tailMap(e).isEmpty().

This approach somehow conflicts with the "default return value" idea, but I am unsure which of the two is more desirable for the typical use cases. To be honest I'd even like to have this "no such element"-behavior for normal maps since it makes debugging a lot easier in some cases. With the advent of getOrDefault I'd even go as far as deprecating default return values completely, since the fastutil-get essentially only is a m.getOrDefault(k, m.defaultReturnValue()), but that is a separate issue.

How would you feel about a fastutil NavigableSet which supplemented Long ceiling(Long e) with long ceiling(long e, long ifNone), etc?

I don't think that this would be used too frequently, but I am not entirely sure about that. If you want to check if the returned value is contained in the set you still have to call something like contains(long e) (except there is some domain knowledge).

One more idea would be to borrow from Optional and have something like .applyCeiling(long e, LongConsumer consumer) but I fear this might be more heavy-weight than just checking with contains(long e)

incaseoftrouble avatar Aug 21 '17 16:08 incaseoftrouble

I'm not sold on throwing NoSuchElementException. First, it's inconsistent with the object versions, which return null. Secondly, it's inefficient. Either you need to probe the map twice (once to check for presence, and again to retrieve the element), or you need to handle an exception, which is usually inefficient.

I think it's particularly awkward to throw exceptions for pollFirst() and pollLast(), since those methods exist specifically to avoid the exceptions of first() and last().

I suspect that users do have domain knowledge in most cases, allowing them to choose values which are either known to be out-of-set, or which are appropriate for the none case. For example, it may make sense to call floorOrDefault(key, Long.MIN_VALUE).

I do agree that deprecating the universal default return values might be a good idea. I think that getOrDefault is easier to reason about than a data structure which carries around its own default value.

pburka avatar Aug 22 '17 00:08 pburka

First, it's inconsistent with the object versions, which return null.

Well, depending on how you look at it. ceiling(Long).longValue() also throws an exception when there is no such value. Similarly, the default return value mechanism is inconsistent with the object-versions. It's more about coming up with the most "intuitive" inconsistency.

Secondly, it's inefficient. Either you need to probe the map twice (once to check for presence, and again to retrieve the element), or you need to handle an exception, which is usually inefficient.

I agree on this, exceptions definitely can't be used as control flow (there was some nice article on that somewhere providing some numbers). I assumed a different kind of "domain knowledge" - one may know that, e.g., the ceiling of some value always is present. At least in my case (dealing with normal maps) I always know which keys are present, so a fail-fast get would be nicer.

I propose the following idea: We could just adapt both approaches, as in we have a ceilingChecked(K e) (or a similar name) which throws a NoSuchElementException if there is no ceiling of e and a ceilingOrDefault(K e, K default) which returns default in that case. One could even add a default ceiling(K e) which forwards to ceiling(e, MAX), where MAX is the particular domain's maximal value (which luckily exists for all primitives ...). For floats and doubles using NaN might be an option, too, but that would get awkward I think.

I think it's particularly awkward to throw exceptions for pollFirst() and pollLast(), since those methods exist specifically to avoid the exceptions of first() and last().

Agree - I'm not familiar at all with NavigableSets so I don't have a "feeling" for the use cases. In this case, it might be sensible to just not add a primitive version of this method, since it always will be awkward: Either you know that there is a first element, then you can call the primitive first() or you don't know and calling the pollFirst() simply is the fastest way to get both the information whether the first exists and, if so, which value it is (in a sense, one can think of the returned Long as OptionalLong)

incaseoftrouble avatar Aug 22 '17 06:08 incaseoftrouble

Design and implementation could probably draw a lot from Guava's ImmutableSortedMap (see RegularImmutableSortedSet), which is implemented using a simple sorted array.

leventov avatar Feb 25 '19 15:02 leventov

But the problem here is mutability. The original post says, indeed, that they are loading the map in order. All strategies I can think for a sorted map based on arrays will have either linear search or linear update time (assuming you won't have space for additional data on top of the array).

vigna avatar Feb 25 '19 19:02 vigna

I stumbled upon this issue because I have a case where I want to sort a "regular" fastutil's ArrayMap. It would be nice if fastutil's ArrayMaps had methods like sortByKeys() and sortByValues(). Then, there can be an adapter class or method "viewSortedArrayMapAsNavigableMap()" that returns an unmodifiable NavigableMap. If the underlying ArrayMap is updated (e. g. a new entry is added), the view is no longer correct, but that's a responsibility of the user.

leventov avatar Feb 25 '19 20:02 leventov