itertools icon indicating copy to clipboard operation
itertools copied to clipboard

More generic maps (in `GroupingMap`)

Open Philippe-Cholet opened this issue 1 year ago • 7 comments

At the moment, this is only a proof of concept, but it would close #588 (and eventually solve some issues, see https://github.com/rust-itertools/itertools/labels/generic-container).

  • I wrote a straightforward private trait generalizing both BTreeMap and HashMap (with any hasher). Maybe too much methods, or not enough. I will adjust it. I don't think we should generalize to more than those two maps but I guess it's possible.
  • Changed use_std to use_alloc where possible (because BTreeMap only needs use_alloc while HashMap needs use_std) and relaxed some Eq + Hash bounds.
  • I added a alternative aggregate_in to aggregate which takes one more argument: any empty map (or one we would .clear()).

The last point is where I struggled the most but that's pretty much the only real possibility because there is so many possibilities of initializations: BTreeMap::new(), HashMap::new(), HashMap::default() for some hashers, and with_capacity shenanigans. And I think it's the most general solution (I can even imagine one single HashMap being used multiple times to avoid reallocations).

This could extend to other areas than just GroupingMap, and a Set trait could be wrote to generalize BTreeSet and HashSet but it's for another PR.

For now, I would like some feedback.

CI currently fails because of temporary unused imports.

Philippe-Cholet avatar Mar 12 '24 10:03 Philippe-Cholet

Currently, we can do it.into_grouping_map().product(); but we can't have a generic map.

let my_map = HashMap::new();
let my_map = HashMap::with_capacity(100);
let my_map = FxHashMap::default(); // rustc_hash
let mut my_map = FxHashMap::default(); my_map.reserve(100);
let my_map = BTreeMap::new();

But where to give the map? (without breaking change)

1. Initial idea: suffix all GroupingMap methods with _in. But it results in many weird names like minmax_by_key_in.

it.into_grouping_map().product_in(my_map);
//                            --- ------

For each GroupingMap method like product, old it.into_grouping_map().product(); would delegate to new it.into_grouping_map().product_in(HashMap::new());.

2. Suffix Itertools methods. The trait would be a bit more crowded. But only two redirections.

it.into_grouping_map_in(my_map).product();
//                  --- ------

Old it.into_grouping_map(); would delegate to new it.into_grouping_map_in(HashMap::new());. Similar for into_grouping_map_by -> into_grouping_map_by_in but that's it.

That would be similar to other possible extensions like Itertools::counts_in and Itertools::all_unique_in.

3. A bit nested. But no _in suffix.

it.into_grouping_map().with_map(my_map).product();
//                    -----------------

For each GroupingMap method like product, old it.into_grouping_map().product(); would delegate to new it.into_grouping_map().with_map(HashMap::new()).product();.

Philippe-Cholet avatar Mar 13 '24 09:03 Philippe-Cholet

Hi @Philippe-Cholet, two thoughts:

  • Does into_grouping_map play nicely with type inference? (Currently, GroupingMap returns maps with different value types, e.g. minmax returns MinMaxResult<V>s, but max returns Vs.) If so, I would go with it (offers a central place to supply a map and generalizes well to counts et al).
  • I know that for many day-to-day cases, a plain old (possibly sorted) Vec<(K, V)> can be more efficient than {BTree|Hash}Map. Thus, I think we should ensure that trait Map could support this type.

phimuemue avatar Mar 13 '24 11:03 phimuemue

@phimuemue I see your performance point about Vec<(K, V)>. It can always be added later (in the next PR, or any release) as we would not break anything by implementing Map for a new type.

Well, silly me...!! I don't know how it could work in a central place (cases 2 and 3 in my last comment). We would create a new struct GroupingMapIn that GroupingMap would alias, right? pub type GroupingMap<I> = GroupingMapIn<I, HashMap<_, _>>;... Simply blocked! The first one would be <I as Iterator>::Item.0 (which I can't write) and the second is still unknown since it's the method using it that will determine it... Then only remain my initial idea (case 1) method_name_in<..., M: Map<Key = ..., Value = ...>>(..., map: M) -> M.

Philippe-Cholet avatar Mar 13 '24 13:03 Philippe-Cholet

Codecov Report

Attention: Patch coverage is 95.34368% with 21 lines in your changes are missing coverage. Please review.

Project coverage is 94.43%. Comparing base (6814180) to head (5a71e63). Report is 40 commits behind head on master.

Files Patch % Lines
src/generic_containers.rs 54.54% 20 Missing :warning:
src/grouping_map.rs 99.75% 1 Missing :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #901      +/-   ##
==========================================
+ Coverage   94.38%   94.43%   +0.04%     
==========================================
  Files          48       49       +1     
  Lines        6665     7187     +522     
==========================================
+ Hits         6291     6787     +496     
- Misses        374      400      +26     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Mar 15 '24 11:03 codecov[bot]

@SkiFire13 I see you are the author of GroupingMap. What do you think of such extension?

Philippe-Cholet avatar Mar 19 '24 09:03 Philippe-Cholet

@SkiFire13

The duplication of the methods is a bit unfortunate though.

I totally agree, I'm gonna investigate your "add two generics" idea.

Could this be extended to into_group_map[_by] too?

Yes, in the next PR. I intend to extend all HashMap usage in the crate (and HashSet later), most https://github.com/rust-itertools/itertools/labels/generic-container issues should be solved with this Map.

Because I want to make all this work first, I'm gonna delay two of your ideas:

  • Optimize remove+insert in GroupingMap::aggregate_in.
  • impl<M: Map> Map for &mut M.

Philippe-Cholet avatar Mar 19 '24 11:03 Philippe-Cholet

Because I want to make all this work first, I'm gonna delay two of your ideas:

* Optimize `remove+insert` in `GroupingMap::aggregate_in`.

* `impl<M: Map> Map for &mut M`.

Understandable. As long as Map is private this can be changed later, so I'm fine with it.

SkiFire13 avatar Mar 19 '24 12:03 SkiFire13