iter icon indicating copy to clipboard operation
iter copied to clipboard

\iter\unique support

Open wiktor-obrebski opened this issue 8 years ago • 5 comments

@nikic what do you think about adding unique method to get iterator of unique values?

wiktor-obrebski avatar Mar 30 '16 14:03 wiktor-obrebski

Every implementation of that feature could introduce some criticism for memory usage. You can implement a unique filter by yourself, without compromise the purity and efficiency of this lib IMHO

If you are willing to follow this way consider some probabilistic filter such as "Bloom filters". Obviously the implementation depends of how the uniqueness is important for you.

toretto460 avatar Jun 25 '16 12:06 toretto460

Maybe in some cases distinct would work? http://reactivex.io/documentation/operators/distinct.html (I know this is not related to reactive, many of the concepts from reactive map very well to iterables though)

SamMousa avatar Mar 29 '17 10:03 SamMousa

I think what @toretto460 is driving at here, is that a "unique" iterator is not really able to operate in a truly streaming fashion. There are essentially two ways to implement uniquing: Either based on a hashtable, which is efficient but does not support all key types, or by sorting all elements. The latter requires converting the iterable into an array upfront, at which point the streaming aspect is lost entirely and one might as well use array_unique() instead. The former only needs to store one key per unique item, so one could still have a memory advantage there if the iterator contains a high number of duplicates.

If unique() is added to this library, I'd go with a hashmap based implementation and a signature of unique($iterable, callable $keyFunction = null), where the latter argument can be used to determine a key that is used for uniquing (which will make this compatible with object types etc.)

nikic avatar Apr 12 '17 16:04 nikic

@nikic I agree that it cannot stream without keeping state.

The advantage of having this as a stateful streaming implementation is that it still defers execution and thus memory usage until it is actually needed. What if the result iterator is only called 10 times? Then i'm pretty sure we'll be fine. Which would not be the case if I had to do array_unique on a source array that doesn't fit in memory.

SamMousa avatar Jul 05 '17 13:07 SamMousa

This can actually be solved by just passing an invokable filter object to filter: For example this should work (untested code):

class GreedyArrayHashFilter
{
    private array $history = [];


    public function __invoke($value): bool
    {
        $serialized = serialize($value);
        $hash = hash('sha256', $serialized);

        if (isset($this->history[$hash])) {
            return false;
        }
        
        $this->history[$hash] = true;
        return true;
    }


}

By implementing it this way you can also use different implementations that could for example:

  • Use an SplDoublyLinkedList instead of an array for (possibly) better performance.
  • Use a probabilistic data structure like a bloomfilter for operation on large sets.

Note that an implementation of these kinds of objects doesn't necessarily fit into this library.

SamMousa avatar Sep 11 '20 11:09 SamMousa