itertools icon indicating copy to clipboard operation
itertools copied to clipboard

Post-hoc addition of new iterators to KMergeBy

Open henselman-petrusek opened this issue 4 years ago • 2 comments

I'm working to make two variants of the (really nice) merging system -- one that allows you to add more iterators to KMergeBy after you've popped some entries out, and one that does essentially the same job, just with a slightly different internal structure (heapifies pointers to a vec of iterators, rather than the iterators themselves).

We weren't sure that either of these would be of interest to the itertools developers (while the former is extremely useful it's also a bit niche, and we haven't seen public methods for updating any of the other iterators in the library). So we've just copied the kmerge_impl file and hacked it from there.

Two questions for the developers: (1) would either of these be of interest for the library, and (2) we noticed that kmerge_impl uses customized itertools macros to implement clone and debug, but we're rust novices and copying macros from a private module seems a little like playing with fire; would there be any harm in replacing these with the standard macros for deriving clone and debug? Thank you!

henselman-petrusek avatar May 17 '21 12:05 henselman-petrusek

would there be any harm in replacing these with the standard macros for deriving clone and debug

If you use derive, it might result in trait bounds too tight (iirc e.g. when using PhantomType internally). As an example, debug_fmt_fields here explicitly does not include less_than, as closures usually do not implement Debug.

However, if you want to use derive in your project, I'd just do it and wait until problems actually pop up. In a library like this, however, we try to remove these obstacles for the users.

would either of these be of interest for the library?

In principle, we are open and willing to incorporate new ideas - but it may take time, and we try to come up with generic and hopefully reusable components.

one that allows you to add more iterators to KMergeBy after you've popped some entries out

I may be misunderstanding this, but is that really specific to KMergeBy?

heapifies pointers to a vec of iterators, rather than the iterators themselves

Can you elaborate on this? Does it essentially heapify iter.by_ref instead of iter?

I hope I do not have a dumb moment and miss the ovious, but it may be easier if you added example use-cases, so that we see what your feature actually allows to do that is impossible with existing infrastructure.

phimuemue avatar May 17 '21 19:05 phimuemue

Thanks very much!

If you use derive, it might result in trait bounds too tight (iirc e.g. when using PhantomType internally). As an example, debug_fmt_fields here explicitly does not include less_than, as closures usually do not implement Debug.

However, if you want to use derive in your project, I'd just do it and wait until problems actually pop up. In a library like this, however, we try to remove these obstacles for the users.

Great! We were only concerned with safety, so the suggestion sounds good.

I may be misunderstanding this, but is that really specific to KMergeBy?

The only unique aspect (from my perspective) is implementation. One can chain many of the operations in the itertools library without much loss of efficiency, however kmerge( [I, J, K, L] ) could be significantly more efficient than merge( merge(I, J), merge(K, L)), since the latter loses out on some of the benefits of an underlying heap data structure.

Can you elaborate on this? Does it essentially heapify iter.by_ref instead of iter?

Yes, that's the essential idea -- one would just add a vector of mutable references to the KMergeBy struct (or first place the iterators into a collection and then pass mutable references to kmergeby). This would only make a significant difference if the iterators were little unwieldy to move themselves. This is true in our use case, since each of our iterators has to store several different types of data, but I don't know how many others would be in a similar situation.

I hope I do not have a dumb moment and miss the ovious, but it may be easier if you added example use-cases, so that we see what your feature actually allows to do that is impossible with existing infrastructure.

I think that the difference between kmerge( [I, J, K, L] ) and merge( merge(I, J), merge(K, L)) goes to the heart of the issue, since the latter might lose some of the benefits of the heap structure.

henselman-petrusek avatar May 18 '21 19:05 henselman-petrusek