iter
iter copied to clipboard
\iter\unique support
@nikic what do you think about adding unique
method to get iterator of unique values?
Every implementation of that feature could introduce some criticism for memory usage. You can implement a unique filter by yourself, without compromise the purity and efficiency of this lib IMHO
If you are willing to follow this way consider some probabilistic filter such as "Bloom filters". Obviously the implementation depends of how the uniqueness is important for you.
Maybe in some cases distinct would work? http://reactivex.io/documentation/operators/distinct.html (I know this is not related to reactive, many of the concepts from reactive map very well to iterables though)
I think what @toretto460 is driving at here, is that a "unique" iterator is not really able to operate in a truly streaming fashion. There are essentially two ways to implement uniquing: Either based on a hashtable, which is efficient but does not support all key types, or by sorting all elements. The latter requires converting the iterable into an array upfront, at which point the streaming aspect is lost entirely and one might as well use array_unique()
instead. The former only needs to store one key per unique item, so one could still have a memory advantage there if the iterator contains a high number of duplicates.
If unique()
is added to this library, I'd go with a hashmap based implementation and a signature of unique($iterable, callable $keyFunction = null)
, where the latter argument can be used to determine a key that is used for uniquing (which will make this compatible with object types etc.)
@nikic I agree that it cannot stream without keeping state.
The advantage of having this as a stateful streaming implementation is that it still defers execution and thus memory usage until it is actually needed. What if the result iterator is only called 10 times? Then i'm pretty sure we'll be fine. Which would not be the case if I had to do array_unique
on a source array that doesn't fit in memory.
This can actually be solved by just passing an invokable filter object to filter
:
For example this should work (untested code):
class GreedyArrayHashFilter
{
private array $history = [];
public function __invoke($value): bool
{
$serialized = serialize($value);
$hash = hash('sha256', $serialized);
if (isset($this->history[$hash])) {
return false;
}
$this->history[$hash] = true;
return true;
}
}
By implementing it this way you can also use different implementations that could for example:
- Use an SplDoublyLinkedList instead of an array for (possibly) better performance.
- Use a probabilistic data structure like a bloomfilter for operation on large sets.
Note that an implementation of these kinds of objects doesn't necessarily fit into this library.