fastutil icon indicating copy to clipboard operation
fastutil copied to clipboard

Fast in-place compute/merge methods in OpenHashMap classes

Open gregsh opened this issue 5 years ago • 33 comments
trafficstars

Some things I'd like to highlight:

  • null handling is preserved for default Map methods, so no defaultValue support (checked in tests)
  • 0 is treated as null element in Double/Float2xxx maps which seems rather suspicious
  • xxxMap#computeIfAbsent methods with xxxUnaryOperator can be named computeIfAbsentFast to avoid clashing

gregsh avatar Jul 05 '20 22:07 gregsh

Friendly bump. Were my changes of any use?

gregsh avatar Nov 11 '20 16:11 gregsh

One note: I think using ...AsInt as suffix is a bit more in line with JDK convention than ...Fast. For computeIfAbsentFast I'd rather say computeAsIntIfAbsent or computeIfAbsentAsInt (replacing Int with the respective type, of course). But that's a matter of taste. Leading to the next point.

Also, please replace the CONCAT with concrete names. Something like COMPUTE_IF_ABSENT macros etc. (which can be defiend using CONCAT of course). This way, other parts of the implementation can call these methods safely (for example, wiriting this.COMPUTE_IF_ABSENT(foo, bar)). And these names can be easily changed at a single place.

incaseoftrouble avatar Nov 11 '20 19:11 incaseoftrouble

@incaseoftrouble thanks for response

I've extracted CONCAT'd names, and tried to replace Fast suffix with something you've proposed.

  1. For maps with primitive keys (e.g. Int2IntOpenHashMap) everything goes smooth. fastutil already uses names like mergeInt in maps with reference keys, so I switched to just mergeInt, computeIntIfPresent (not mergeAsInt, etc.).
  2. For maps with reference keys (e.g. Object2IntOpenHashMap) we have a problem. They already use computeInt and mergeInt names for methods with java.util.function.BiFunction argument. 2.1 We can't use Int suffix due to lambda-unfriendly overloading. mergeInt(0, 1, (a, b) -> a) won't compile without explicit cast. 2.2 We can't use AsInt suffix because mergeInt and mergeAsInt or mergeIntAsInt methods in the same class look indistinguishable.

Also "fast" methods in maps with primitive and non-primitive keys should have similar names, shouldn't they? I'm not sure how to deal with all that without breaking the current API.

gregsh avatar Nov 12 '20 13:11 gregsh

Ugh. What a headache.

What if instead of providing a new method we check whether the given remapping function (in COMPUTE_IF_PRESENT) is instanceof JDK_PRIMITIVE_BI_FUNCTION? Unfortunately, these methods aren't bulk operations, so the instanceof is called quite often. But I guess this is better than providing yet another specialized method. The difference is that this does not cause boxing / unboxing, so the only overhead is a cast, no objects are being allocated. Also, the instanceof check can actually be delayed until the function needs to be called.

incaseoftrouble avatar Nov 12 '20 13:11 incaseoftrouble

On the second thought, mergeInt and computeInt overloads look not that bad, consider: mergeInt(1, 2, (int a, int b) -> a) or m.mergeInt(2, 2, (IntBinaryOperator)(a, b) -> a).

Source-compatibility would be broken, but binary compatibility would stay intact. Existing code would need explicit types to be compiled, e.g. mergeInt(1, 2, (Integer a, Integer b) -> a).

What do you think?

gregsh avatar Nov 12 '20 15:11 gregsh

Well, source compatibility is important. I'd rather have different names than adding that cast (pragmatically, that's less characters). Just compare: map.mergeIntInt(1, 2, Integer::sum) to map.mergeInt(1, 2, (IntBinaryOperator) Integer::sum). I personally prefer the former.

But its maybe for @vigna to decide :-)

incaseoftrouble avatar Nov 12 '20 16:11 incaseoftrouble

Welcome to naming headaches in fastutil—one of my favorite pastimes in the last 22 years 😂.

So, I examined the code and it's great work. Sorry for the delay in answering. Let's try to work out the naming problems and merge this in.

A good question: why not creating type-specific BiFunctions? From what I understand in that case here the advantage is only in the return type...

vigna avatar Nov 15 '20 22:11 vigna

A good question: why not creating type-specific BiFunctions? From what I understand in that case here the advantage is only in the return type...

Well, if you want all then you'd cubic amount of functions (x, y -> z), maybe we can generate one for a) the "primary" types (int, long, double; for computations it doesn't matter anyway) and b) only where Java doesn't have one already.

incaseoftrouble avatar Nov 16 '20 09:11 incaseoftrouble

After having see @techsy730 clever idea for forEachRemaining(), I think we can try to do the same here. The idea is to have a type-specific BinaryOperator class. Then we implement the version based on the JDK primitive thing, if it exists, and we add an identical method accepting a type-specific BinaryOperator. The trick is that the latter is more specific than both the other options (JDK primitive and not), and so it is automatically chosen by the lambda.

The only difference is a little code duplication—in the case of Iterator, @techsy730 added a stub delegating the method. But in this case is a single delegation for a batch of work. In this case I would simply duplicate the (short) code for the two versions for int, long and double. In the other cases there's no duplication.

vigna avatar Dec 18 '20 18:12 vigna

@vigna Do I understand correctly that you are referring to the following solution?

import java.util.function.*;

class Scratch {
  public static void func(ObjectIntMap<Integer> map) {
    map.mergeInt(10, 20, (a, b) -> a + b);  // ok, IntBinaryOp inferred
    map.mergeInt(10, 20, Integer::sum);  // ok, IntBinaryOp inferred

    // We have preserved source (almost) and binary compatibility but:
    // 1. IntBinaryOp interface and the related method are here forever for binary compatibility

    // 2. the following code will become an error
    map.mergeInt(10, 20, (a, b) -> null);

    // 3. the following code is now also possible 
    map.mergeInt(10, 20, new IntBinaryOp() {
      @Override public int applyAsInt(int left, int right) { return left + right; } // <--- that one is used
      @Override public Integer apply(Integer left, Integer right) { return left - right; } // ignored
    });
  }

  interface ObjectIntMap<K> {
    int mergeInt(K key, int v, BiFunction<? super Integer, ? super Integer, ? extends Integer> func); // JDK boxed
    int mergeInt(K k, int v, IntBinaryOperator func); // JDK primitive

    // more specific version with our helper interface
    int mergeInt(K k, int v, IntBinaryOp func); // duplicate or delegate to JDK primitive implementation
  }

  @FunctionalInterface
  interface IntBinaryOp extends IntBinaryOperator, BiFunction<Integer, Integer, Integer> {
    @Override
    default Integer apply(Integer integer, Integer integer2) {
      return this.applyAsInt(integer.intValue(), integer2.intValue());
    }
  }
}

gregsh avatar Dec 19 '20 00:12 gregsh

Almost. IntBinaryOp has in my mind has an apply(int, int) method, which is the one delegating to applyAsInt() if available. As usual, the method inherited from BinaryOperator delegates to the type-specific one and is deprecated. The ignored method would give a warning as you'd be overriding a deprecated method. Another advantage is that we would have composition for all types.

It is also possible in the future that we do something like @techsy730 work on streams, in which we also adapt JDK primitive operators (e.g., you can use an IntBinaryOperator with character values).

More precisely, a type-specific BinaryOperator interface would have a default implementation of applyAsInt() delegating to the type-specific apply().

vigna avatar Dec 21 '20 18:12 vigna

So, I slightly hijacked this PR in the sense that I implemented type-specific binary operators. The modifications to gencsources.sh are no longer necessary. Could you try to bring this PR in the form we discussed above?

vigna avatar Dec 22 '20 11:12 vigna

I've pushed changes for Int/Long/Double maps as we discussed above (had to fix BinaryOperator.drv a bit). We can use new type-specific operators for all other primitive maps.

gregsh avatar Dec 23 '20 03:12 gregsh

OK, I understand a bit better now the situation. The problem is that this has to be fixed first at the interface level, or open hash maps will have type-specific method no other implementation has.

If I understand correctly, we would add compute() methods only if the type of keys and values is the same (as otherwise bifunctions would be too many) and merge for all types because in that case the bifunction it is just a binary operator. Presently, moreover, the only code is for the JDK special types int/long/double, for which there are functions. We would get some ambiguity when there is an object-based operator and a JDK operator.

vigna avatar Dec 24 '20 09:12 vigna

OK, I understand a bit better now the situation. The problem is that this has to be fixed first at the interface level, or open hash maps will have type-specific method no other implementation has.

Sorry, I was referring to Byte/Char/Short/.. open hash maps, as those types now have their binary operators missing in JDK thanks to BinaryOperator.drv. Or we can use widened operators there for JDK methods. I don't care about them much, so it is for you to decide. I.e.:

  interface Object2CharOpenHashMap<K> {
    int mergeChar(K k, char v, BiFunction<? super Character, ? super Character, ? extends Character> func); // existing
 // int mergeChar(K k, char v, IntBinaryOperator func); // optional JDK primitive for widened type
    int mergeChar(K k, char v, CharacterBinaryOp func); // type-specific operator
  }

If I understand correctly, we would add compute() ...

Yes, compute and computeIfPresent have same signature (K, (K, V)-> V) -> V, so type specific methods with BinaryOperator for (K, V)->V argument is generated only when K and V have equal wide type. The logic I've added is kind of dormant now, as I skip those methods for non-Int/Long/Double maps.

gregsh avatar Dec 24 '20 13:12 gregsh

Hey, we're working for a better world—in the sense of a more readable #ifdef'ing of the code. We'll come back to this PR as soon as things are stabilized (hopefully soon).

vigna avatar Dec 30 '20 15:12 vigna

So, there has been a lot of internal reworking inside fastutil, so we will need to bring this up to date with those reworking. I'm also thinking about including functions K⨉K->V and K⨉V->V, which are relatively limited in number, so we can have type-specific implementation of these functions (compute, merge, etc.) for all types. We're very close to a release, so I would wait for the next one (soon) to include this stuff and possibly solve #101 , which is strictly related.

vigna avatar Jan 14 '21 08:01 vigna

Something wrong with 4be250e92d8ce6eb0efcaa4545046067340aaa9b. MERGE def is deleted and a new method with MERGE name is added in Map.drv. I can't compile my PR because it uses MERGE=merge and I get duplicate merge methods error in Int2ObjectMap and other *Map classes.

gregsh avatar Jan 15 '21 01:01 gregsh

Urgh. I made so many changes—some mistakes slipped through. Thank you for reporting this, I'm looking into it...

vigna avatar Jan 15 '21 11:01 vigna

That should be fixed with 6bacc8990. There were several cases in which the method was compiled in, but it wasn't necessary.

I think I didn't left out any case (or there would be pre-existing methods that do not exist anymore).

vigna avatar Jan 15 '21 11:01 vigna

Compiles OK now. Are there any questions left for this PR?

gregsh avatar Jan 15 '21 15:01 gregsh

Almost there.

  • To bring this in line with the overall restyling, we need to avoid #ifdef. There are several things you can use, KEYS_INT_LONG_DOUBLE, etc., but we have tried to eliminate anything that is not readable immediately. #ifdef SOME_METHOD makes incredibly difficult for the reader to understand when the code will be compiled in.
  • Did you try whether no ambiguity arises? You should be able to write lambdas for all these methods without any type indications.
  • I would like to add functions XY->Y and XX->Y to complete the picture.

So the plan is to release 8.5.0 (which certainly will have bug reports), and while it stabilizes complete this PR. I would like to have those functions in because there might be ambiguity problems we do not realize immediately. Then we can have default methods in Map.drv, and these optimized methods for hash maps.

Adding the functions is a bit tricky, because to preserve your idea, which I think is very good (always supporting the JDK binary operator in some way) we need some conditional compilation similar to that for functions. Like, CharChar2ByteFunction should extend IntBinaryOperator. In this way we can play that "three method game" that we played with all other methods (one method has a type that extends both the other two).

Does that sound sensible?

vigna avatar Jan 15 '21 16:01 vigna

To bring this in line with the overall restyling, we need to avoid #ifdef.

Do you have some specific code lines in mind?

Did you try whether no ambiguity arises?

Sure. Object2IntOpenHashMapTest.java has new tests for that.

I would like to add functions XY->Y and XX->Y to complete the picture.

We can add that later, while already testing the new API...

gregsh avatar Jan 15 '21 16:01 gregsh

Do you have some specific code lines in mind?

Everything that is not explicit. #ifdef COMPUTE, for example. In general, no #ifdef (stuff like KEYS_INT_LONG_DOUBLE has an actual value 0 or 1, so you can #if it).

Sure. Object2IntOpenHashMapTest.java has new tests for that.

I would like to add functions XY->Y and XX->Y to complete the picture.

We can add that later, while already testing the new API...

Mmmh. My only fear is that adding methods we might discover that some change of signature is necessary, and then we cannot rollback. Let me think, I might add the functions now, it's not that much work.

vigna avatar Jan 15 '21 19:01 vigna

To bring this in line with the overall restyling, we need to avoid #ifdef.

Done. Anything else?

gregsh avatar Jan 18 '21 01:01 gregsh

If I understand correcty, we have here two type of methods: more efficient implementations of existing interface methods, and some completely new methods (*_PRIMITIVE) that are not present in the type-specific Map interface. Am I right?

vigna avatar Jan 29 '21 13:01 vigna

If I understand correcty, we have here two type of methods: more efficient implementations of existing interface methods, and some completely new methods (*_PRIMITIVE) that are not present in the type-specific Map interface. Am I right?

You are right.

gregsh avatar Jan 29 '21 13:01 gregsh

I think it should be easy to lift those method to default interface implementations no?

vigna avatar Jan 29 '21 14:01 vigna

The rationale here is that probably the methods are fine and there is no need to include hundreds of functions to make them align with the strategy used elsewhere, but it is not a good idea to have methods that appear just in an implementation as people might not get access to them. From what I see, the four _PRIMITIVE methods's code can be very easily lifted up to Map.drv with small changes, and then we would have an backward-compatible interface extension and an optimized implementation for hash-based maps.

vigna avatar Feb 01 '21 08:02 vigna

Did I miss something?

vigna avatar Feb 18 '21 08:02 vigna