groovy-core icon indicating copy to clipboard operation
groovy-core copied to clipboard

GROOVY-6992: Added support for collecting over sets

Open mperry opened this issue 11 years ago • 35 comments
trafficstars

Pull request for https://jira.codehaus.org/browse/GROOVY-6992.

Collecting over sets without passing in a collection to add to returns a list. We really want it to return a set.

I added support for this to DefaultGroovyMethods for a set using the default set of LinkedHashSet. I noticed this was used as the default. It worked for HashSet too, but I have kept the current default. I initially tried TreeSet, but this class is broken in Java and should always require an ordering operation for elements.

A warning, this may break existing code that relies on collecting over a set returning a list. For this behaviour, toList() can be called either before or after the collect method.

mperry avatar Jul 31 '14 00:07 mperry

An example showing the ramifications of this breaking change:

def nums = [2,4,5,6]
Set a = [*nums]
Set b = [*nums, 'foo', 'bar']

// before
assert [1, 2, 3] as Set == a.collect { (int) (it / 2) }.toSet()
assert ['4 * Integer', '2 * String'] == b.collect { it.class }.groupBy().collect{ k, v -> "${v.size()} * $k.simpleName" }

// after
assert [1, 2, 3] as Set == a.collect { (int) (it / 2) }
assert ['4 * Integer', '2 * String'] == b.toList().collect { it.class }.groupBy().collect{ k, v -> "${v.size()} * $k.simpleName" }

So for a you would no longer need toSet() but for b you need to call toList() before collect. It's not clear to me that the proposed change is obviously better. Groovy doesn't assume the transformation in the collect closure produces something which should have a uniqueness property.

paulk-asert avatar Jul 31 '14 00:07 paulk-asert

Another example:

Set a = [-100, 100, 10]
def avgAbsolute = a*.abs().with{ it.sum() / it.size() } // *. is collect shorthand
// before
assert 70 == avgAbsolute
// after
assert 55 == avgAbsolute

Obviously you can get the alternatives:

// before for 55
a*.abs().toSet().with{...}
// after for 70
a.toList()*.abs().with{...}

paulk-asert avatar Jul 31 '14 01:07 paulk-asert

For me, I found the behaviour of mapping over sets surprising. I thought mapping over functors should return the same type and initially I thought a set would be a functor. For reference, the haskell signature is:

fmap :: (a -> b) -> f a -> f b

The bigger issue is should all collections be transformed into lists by default? I had a look at the JDK to see what this could also apply to and the main interfaces I saw were List, Set and Queue. Would this also apply to the Iterable interface? Hmm, after thought Iterable is probably less relevant because it is accessing a structure sequentially like a list.

Now perhaps a set is not a functor. You can't guarantee that composing two functions together will get the same result as applying them individually:

set.map(f).map(g) != set.map(f.andThen(g))

For more on the design issue of set and functor in Scalaz: http://typelevel.org/blog/2014/06/22/mapping-sets.html http://jwesleysmith.bitbucket.org/set-functor.pdf http://failex.blogspot.jp/2013/06/fake-theorems-for-free.html

mperry avatar Jul 31 '14 01:07 mperry

Paul, I had not seen using '*' to convert to sets, only the spread operator for collect. How does it convert to a set and what is the difference between a and b below (where a is a set and b is a list)

def nums = [2,4,5,6]
Set a = [*nums]
def b = [*nums]

mperry avatar Jul 31 '14 02:07 mperry

Does the same thing apply to collectMany?

mperry avatar Jul 31 '14 04:07 mperry

The list despread operator has nothing to do with Set assignment - it was just a way of removing some dupe in the above example (you could wonder why there is no set despread operator but that's perhaps for another day).

There is special target typing performed for list literals. The type given to the list literal is given special treatment for typed declarations with a Collection type. Think "diamond" operator in Java or method parameter target typing in Java 8. So basically Set a = [1, 2, 3] is the same as def a = [1, 2, 3] as Set or def a = new LinkedHashSet([1, 2, 3]).

paulk-asert avatar Jul 31 '14 04:07 paulk-asert

Yes, the type is retained for methods which return a filter, subset or expansion of some original collection, e.g. things like: grep, findAll, split, take, drop, intersect, plus, minus, but where a transformation is (potentially) performed then a new default collection is returned (generally an ArrayList).

One thing that is being thought of for 2.4+ is improved target type inferencing for DGM methods. So for the methods above (but not collect or collectMany) then we know if you provide a Set, then a set is returned, rather than having to provide a special DGM Set variant, those methods might be annotated (with something like @RetainsType - working suggestion only). That way you'd get some of the type inference I imagine that you are after.

paulk-asert avatar Jul 31 '14 05:07 paulk-asert

We are chasing our tails because we don't have higher-order types. I want the to use the functor typeclass, or alternatively in OO terms, have a co-variant return type that matches the structure we are performing the method on (which means duplicate code compared to typeclasses). My argument is we should be allowed to override the method with a co-variant return type, rather than keep it invariant as Collection.

Why should list be regarded as special that everything should return?

Why should collect and collectMany and list be treated specially for the 2.4 type inference proposal?

Would the target type inferencing for RetainType work statically? I can see this would be ok dynamically, I guess it could work statically by generating methods for each subclass at compile time. I guess the compiler could use that information to do type inference fro each call to the method.

Converting everything back to list makes it harder to write more general libraries if list is treated specially.

mperry avatar Jul 31 '14 05:07 mperry

One idea for 2.4 that I have not thought through is to be able to attach a trait to an existing class, rather than have to write extension methods to add those methods. So if trait T is attached to class C, an object of type C could be passed around as a T.

mperry avatar Jul 31 '14 05:07 mperry

+1 for higher-order kinds. But how to implement nicely? So many questions! :-)

There is no implementation of @RetainType at present - just design discussions and prototyping bits of it. The current early design ideas are the @RetainType would be used only for type inferencing during static compilation. It wouldn't generate a whole family of methods - so no e.g. Set variant would be visible from Java for instance - it would just know about the right type in use when checking for errors and could do appropriate casts in some places if needed at the bytecode level.

I'd have to think a bit more about your trait idea but I am sure the Groovy team is open to useful suggestions. Do you have some concrete examples of how the idea would work?

paulk-asert avatar Jul 31 '14 07:07 paulk-asert

I think the better implementation way for this is to change the return type of Collection.collect(Closure) from List to Collection, and use createSimilarCollection(self) instead of new ArrayList(self.size()). You don't need to add defaultSet(). It is already in createSimilarCollection().

But I think one of the biggest problem is the upper compatibility as Paul said.

yukoba avatar Jul 31 '14 08:07 yukoba

Hi guys. Note that everything that changes the method signature is unlikely to be accepted anytime soon as it is a binary breaking change.

melix avatar Jul 31 '14 08:07 melix

This discussion shows one fundamental problem in those collections based operations we offer, and that is trying to keep the characteristics of the origin, while at the same time not being able to really do that. For example, even if we produce a LinkedHashSet for an LinkedHashSet, as soon as a custom Set class is used, all bets are of. And maybe Groovy started to develop a bit into the wrong direction here

We started of with an List type (I think ArrayList) as basic result of operations. If this is known, then all users have to do is to convert. Looking at Java 8, that result type is Stream. If you feed in a Set, you will later have to convert the Stream back to a Set. Currently this looks to me like a better approach.

Then about functors... I am a bit confused, and surely you know much more about functional programming terms than I do... but I thought a functor is a function, not a type. So in my understanding you want collect to be a functor, not Set.

Going to set.map(f).map(g) == set.map(f.andThen(g))

So for example:

Set set = [1,2,3]
def twoTimes = {it*2}
def intHalf = {it.intdiv(2)}
assert set.collect(intHalf).collect(twoTimes) == set.collect(twoTimes << intHalf)

This assert is already true and for this here it does not matter if the result of Set#collect is a List or a Set. Sure, it is no homomorphism then. I am aware

For me the practical problem is more, that without some kind of lazy evaluated structure you won't get any advantage of this. the left side of the assert will performane worse than the right side.

blackdrag avatar Jul 31 '14 09:07 blackdrag

One option for the upper compatibility problem is to use a new method name map() and implement Iterable.map(Closure) like this.

public static <S,T> Collection<T> map(Iterable<S> self, @ClosureParams(FirstParam.FirstGenericType.class) Closure<T> transform) {
    return collect(self, (Collection<T>) createSimilarCollection(self), transform);
}

And write "We recommend you to use map() instead of collect()" in collect() JavaDoc.

My argument is we should be allowed to override the method with a co-variant return type, rather than keep it invariant as Collection.

In Groovy, if you don't use @TypeChecked or @CompileStatic, Collection is enough.

yukoba avatar Jul 31 '14 11:07 yukoba

You are probably familiar with these references already @paulk-asert, but perhaps others aren't.

The story of how higher order types got into Scala is pretty interesting. Vincent Crement wrote a compiler for Featherweight Generic Java (FGJ) and generalised parameters representing types to type constructors. The paper from his work is from 2008 (http://www.jot.fm/issues/issue_2008_06/article2.pdf).

From the implementation section on p 40: "Shortly after we implemented and made public a prototype compiler for FGJ-omega, type constructor parameterization was independently integrated into the Scala compiler under the name of type constructor polymorphism [MPO07]. The authors of the Scala implementation based their syntax on an unpublished version of the present paper where type constructor parameterization were described in the context of Scala. However they extended our syntax to make type constructor parameterization smoothly interact with pre-existing features of the language. ...Our work can be considered as a theoretical foundation for the subset of Scala that corresponds to FGJ-omega."

Odersky I think was at or around the Swiss university at the time and integrated this into Scala. His paper on the topic is "Generics of a Higher Kind", http://adriaanm.github.io/files/higher.pdf.

Perhaps in another dimension Java would have higher order types before Scala!

The site where this work was done was down for a long time, but I just followed the link tonight and it looks like it is up again (http://lampwww.epfl.ch/~cremet/FGJ-omega/). I have not had a chance to look around the site yet, but the compiler and the Coq proofs should be interesting.

I will have a think more to flesh out the trait idea. At the moment we can add methods to classes, but why not add whole traits/interfaces? I think this would mean Groovy had similar functionality to Haskell typeclasses and Scala implicits.

@yukoba made a comment on the return type that we should use createSimilarCollection instead of defaultSet. Note that all of this is not ideal as it will only create a similar collection for those few classes that we code for. This is not a general solution. A general solution here requires a function that creates the structure i.e. (Unit -> C<A>), where C is a collection, most often known as unit on monad with signature (A -> M<A>).

I have been exploring these ideas on a series of blog posts (http://mperry.github.io/). I am part of the way through writing up a post about implementing monads, but you can see the results already here (https://github.com/mperry/functionalgroovy/tree/master/typeclass/src/main/groovy/com/github/mperry/fg/typeclass). The concrete folder under the previous URL shows concrete functor, applicative and monad instances for lists and options. @blackdrag, you might find it useful to start here on functors. A functor is just any class with a single generic type T<A> that implements the map/collect/fmap method with signature T<B> map(F<A, B>). I showed how to do this in Groovy here (http://mperry.github.io/2014/07/01/groovy-functors.html).

I have then been adding these methods to existing classes through Groovy extension modules. Here is an example for adding the monadic methods to List in ListMonadExtension (https://github.com/mperry/functionalgroovy/tree/master/main/src/main/groovy/com/github/mperry/fg). In essence we create a concrete method for each monad method whose implementation calls the appropriate monadic method. So we add flatMap/liftM2/other monad methods to SetExtension2, but the implementation calls the monad implementation for Set that just defines unit and flatMap. Here is where attaching the trait to an existing class would be useful. If we attached ListMonad<A> to List<A> then List would get all the monad methods: unit, flatMap, sequence, traverse, liftM2, etc. This would allow us incredible flexibility where the original class writer does not need to see all possible abstraction for their class. For example, this would allow Lists and other monads to be used in monad comprehensions that rely on the presence of flatMap and map.

Groovy is moving towards more commitment to type safety. The advantage of all this better abstraction, more reuse, better reasoning about code, fewer defects and faster development time.

This is a pretty long winded comment, so I hope others find it useful.

mperry avatar Jul 31 '14 13:07 mperry

Ok, so I wrote I longer comment here and decided to delete it all again, because there is one point I don't really have a clue about yet. You define a functor as any class that defines T<B> map(F<A, B>). But looking at http://mperry.github.io/2014/07/01/groovy-functors.html you right away start with M<B> map(M<A> m, F<A, B> f);. What is m? Because it smells to me like that m should be actually "this" - in other words the source of elements f will be applied on. But if that is right, then it should actually be abstract class M<A> { def <B> M<B> map(F<A,B> f) }

Of course you cannot define a functor trait like this. And I think that is really what you are after in the end.

blackdrag avatar Jul 31 '14 14:07 blackdrag

return type that we should use createSimilarCollection instead of defaultSet. Note that all of this is not ideal as it will only create a similar collection for those few classes that we code for. This is not a general solution.

I think so too. Therefore, to add immutable collections, I have to do it inside of Groovy https://github.com/groovy/groovy-core/pull/482, and every time Oracle add a Collection class to Java, they have to be added to createSimilarCollection() https://github.com/groovy/groovy-core/pull/478 .

But isn't changing those system is quite hard task and lost compatibility?

yukoba avatar Jul 31 '14 14:07 yukoba

I will try to explain the pseudo Java for the Mappable<M<Z>> interface in the blog. Here M is the generic type parameter which itself takes a single generic type Z. This is invalid Java. Say we can implement the Mappable<List<_>> type, or a Mappable<Set<_>> type or the Mappable<Option<_>> type. They are all structures that can be mapped over. So you want to parameterise the type that can be mapped over so that map takes a parameter of the correct type (M<A>) and returns the correct type (M<B>). This means map for a Mappable<Set<_>> will return Set<B>, but for a Mappable<Option<_>> will return an Option<B> for any A and B used in the map function.

This is not that important for Functor because it has just a single method. Applicative (http://mperry.github.io/2014/07/02/groovy-applicatives.html) extends Functor and has two abstract methods (pure and apply). From these two methods, many other useful concrete methods can be derived which all use the type parameter of the class. The examples I gave for Applicative were liftA3 and sequenceA which both use the type parameter for Applicative.

A good introduction to the topic might be http://learnyouahaskell.com/functors-applicative-functors-and-monoids.

What may be confusing is that my proposal to add a trait to the class has a mismatch on the usual 'this' parameter. I have not thought carefully about this yet and so it is unresolved.

mperry avatar Jul 31 '14 14:07 mperry

@yukoba The problem you are having is that the map method of Functor can't be implemented by itself in a vacuum because it needs to know how to create the new structure. Any monad (with the unit and flatMap methods) can implement Functor's map method like this (https://github.com/mperry/functionalgroovy/blob/master/typeclass/src/main/groovy/com/github/mperry/fg/typeclass/Monad.groovy).

Groovy cheats by knowing about how to implement the unit method (to create the new structure) for some common collections and then defaults to List (I think) for those it does not know about. This works much of the time, but is fundamentally broken. We need specific map methods for each structure that have specific unit methods that create the appropriate similar structure. This is why implementing the general map method in Functor requires a unit implementation from Monad.

On my todo list is to have a close look at what is available in PCollections and compare it to the immutable data structures available in FunctionalJava.

mperry avatar Jul 31 '14 15:07 mperry

Mark introductions are often fine and all, but they do for example not really explain why a functor should be a homomorphism. And what implications it has if not... in other words I am missing the bigger picture.

blackdrag avatar Jul 31 '14 16:07 blackdrag

I will write https://en.wikibooks.org/wiki/Haskell/Understanding_monads/List more simply.

If a collection library implements these two methods, those collection classes can handle in the same way.

1."Haskell return" == "Functional Groovy unit()"

static <T> List<T> createSingleton(T e) {
    [e]
}

2."Haskell binding operator" == "Functional Groovy flatMap()" In precise, Haskell concat() and Groovy flatten() is not same. But please ignore.

/** @param f = { T e -> returns List<T> } */
List<T> collectThenFlatten(Closure f) {
    this.collect(f).flatten()
}

However, the characteristic feature of Groovy is Groovy uses normal Java collection classes. java.util.Collection does not even require a zero arguments constructor. Isn't it impossible if Groovy uses normal Java collection classes?

yukoba avatar Jul 31 '14 17:07 yukoba

One possible solution might be this.

First create a marker interface.

public interface GroovyCollection<T> extends Collection<T> {}

GroovyCollection must implement this static method and must have a public default constructor with no arguments.

public static <T> GroovyCollection<T> createSimilar(GroovyCollection<T> from);

Because asType(Collection col, Class<T> clazz) calls the zero argument constructor, you can write [1, 2, 3] as MyList, if MyList implements GroovyCollection.

Also in createSimilarCollection() add this.

if (orig instanceof GroovyCollection) {
    return invokeStaticMethod(orig.getClass(), "createSimilar", orig);
}

yukoba avatar Jul 31 '14 18:07 yukoba

I made a test implementation of above. See https://github.com/yukoba/groovy-core/commit/45402abdbd88b933df393d65e393cf8261cca9a9 .

Normal, @TypeChecked and @CompileStatic work properly, but IntelliJ IDEA 14 EAP shows errors at @TypeChecked and @CompileStatic.

yukoba avatar Jul 31 '14 19:07 yukoba

Sorry, GroovyCollection.createSimilar() does not need to be a static method. I updated here https://github.com/yukoba/groovy-core/commit/745e42a722747b6585ac252fd0aa1791e89890f2 .

yukoba avatar Aug 01 '14 03:08 yukoba

I think what Mark wants is something like this:

trait CollectionFactory<T> {
    abstract T createSimilar()
}

trait ArrayListFactory<T> implements CollectionFactory<ArrayList<T>> {
    public ArrayList<T> createSimilar() { return new ArrayList<T>() }
}

ArrayList<Integer> foo = ...
def sim = foo.createSimilar() // trait mixed into ArrayList

This cannot be done using an extension module. It also fails compiling, and it's a bug using:

def list = [1,2].withTraits(ArrayListFactory)
list.createSimilar()

Yet, one doesn't need traits to do this. With extension modules, you can already declare those:

class CreateSimilarCollections {
    static <T> ArrayList<T> createSimilar(ArrayList<T> self) { ... }
}

The problem with this approach is that even if the method is available at runtime, there is no marker interface added to the class which will let you know that the method is available. This problem cannot be fixed with the traits approach either: unless Groovy uses class rewriting at load time, the JVM will not let you add methods or interfaces to existing classes.

@yukoba: This approach requires to implement the marker interface, which is probably not what users want. Also calling InvokerHelper.invoke is in general not a good idea, because calls are uncached.

melix avatar Aug 01 '14 07:08 melix

So this problem boils down in the inability to specify generic enough signatures on the typing side and a missing unit morphism for Collections on the practical side. createSimilarCollection cannot be a replacement for unit, because it returns not the exact same container, only a similar one

The Java8 solution for this is the conversion to the Monad Stream and then convert back to whatever is needed. Of course at least the last step is done by the user. And cannot be done by us, since unit is missing or incomplete. For pcollections I can solve the problem of missing unit, but adding a fitting method to the interface everyone is supposed to overwrite.

blackdrag avatar Aug 01 '14 08:08 blackdrag

So,

  • java.util.stream.Stream.of() is "Haskell return"
  • java.util.stream.Stream.flatMap() is "Haskell bind operator"

Therefore, java.util.stream.Stream is a list monad. I see.

yukoba avatar Aug 01 '14 08:08 yukoba

The problem with this approach is that even if the method is available at runtime, there is no marker interface added to the class which will let you know that the method is available.

@melix, is it possible to check the existence of createSimilar() method created by the extension module? If Yes, then we can call it from DefaultGroovyMethodsSupport.createSimilarCollection(). Then marker interface is not needed. It is a duck typing system.

yukoba avatar Aug 01 '14 08:08 yukoba

It is possible to check the existence of the method at runtime. Performance wise, this would however be terrific.

melix avatar Aug 01 '14 08:08 melix

It is possible to check the existence of the method at runtime. Performance wise, this would however by terrific.

The first argument of createSimilarCollection() is Collection, so the checking class is always Collection. And Collection class is usually limited.

If you create two caches

  • Set<Class> groovyCollectionClassSet
  • Set<Class> notGroovyCollectionClassSet

and if you check it before searching createSimilar() method, then I think performance loss is limited.

yukoba avatar Aug 01 '14 08:08 yukoba