cats Optimize traverse

This addresses the StackSafeMonad part of https://github.com/typelevel/cats/issues/4408. I thought that there were enough changes in this to merit a separate PR for the Defer part.

Open questions (will edit these out of the description later):

The TraverseFilter laws hard-code Option. How do we test that the new branching implementation is lawful?

The following benchmarks were run on an AWS t3.xlarge:

Baseline (b08196e8526f6025d009ac0c8efc0a0e0ebcb1c4)

Benchmark	(length)	Mode	Cnt	Score	Error	Units
TraverseBench.filterList	10000	thrpt	20	3247.486	± 270.817	ops/s
TraverseBench.filterVector	10000	thrpt	20	3351.585	± 255.883	ops/s
TraverseBench.mapList	10000	thrpt	20	3198.905	± 235.865	ops/s
TraverseBench.mapVector	10000	thrpt	20	3906.339	± 225.508	ops/s
TraverseBench.traverseChain	10000	thrpt	20	556.706	± 31.814	ops/s
TraverseBench.traverseChainError	10000	thrpt	20	1278.954	± 74.242	ops/s
TraverseBench.traverseFilterChain	10000	thrpt	20	536.841	± 38.696	ops/s
TraverseBench.traverseFilterList	10000	thrpt	20	532.103	± 30.473	ops/s
TraverseBench.traverseFilterVector	10000	thrpt	20	590.312	± 29.534	ops/s
TraverseBench.traverseList	10000	thrpt	20	537.023	± 33.287	ops/s
TraverseBench.traverseListError	10000	thrpt	20	1674.930	± 48.771	ops/s
TraverseBench.traverseVector	10000	thrpt	20	581.844	± 32.074	ops/s
TraverseBench.traverseVectorError	10000	thrpt	20	1778.334	± 135.859	ops/s
TraverseBench.traverse_Chain	10000	thrpt	20	490.822	± 14.145	ops/s
TraverseBench.traverse_List	10000	thrpt	20	474.568	± 28.995	ops/s
TraverseBench.traverse_Vector	10000	thrpt	20	622.924	± 38.002	ops/s

Chain (2d5f4d7917804e921488f5f2e263e8349a6ceb42)

Benchmark	(length)	Mode	Cnt	Score	Error	Units
TraverseBench.filterList	10000	thrpt	20	3170.974	± 146.628	ops/s
TraverseBench.filterVector	10000	thrpt	20	3217.881	± 181.087	ops/s
TraverseBench.mapList	10000	thrpt	20	3443.969	± 104.117	ops/s
TraverseBench.mapVector	10000	thrpt	20	3827.401	± 148.074	ops/s
TraverseBench.traverseChain	10000	thrpt	20	784.845	± 25.990	ops/s
TraverseBench.traverseChainError	10000	thrpt	20	1302.782	± 62.342	ops/s
TraverseBench.traverseFilterChain	10000	thrpt	20	748.986	± 62.151	ops/s
TraverseBench.traverseFilterList	10000	thrpt	20	669.817	± 41.798	ops/s
TraverseBench.traverseFilterVector	10000	thrpt	20	618.644	± 28.558	ops/s
TraverseBench.traverseList	10000	thrpt	20	583.413	± 27.079	ops/s
TraverseBench.traverseListError	10000	thrpt	20	1253.510	± 76.328	ops/s
TraverseBench.traverseVector	10000	thrpt	20	581.975	± 27.619	ops/s
TraverseBench.traverseVectorError	10000	thrpt	20	1302.362	± 66.015	ops/s
TraverseBench.traverse_Chain	10000	thrpt	20	3598.033	± 108.420	ops/s
TraverseBench.traverse_List	10000	thrpt	20	4048.287	± 158.090	ops/s
TraverseBench.traverse_Vector	10000	thrpt	20	4094.024	± 101.115	ops/s

Vector (702ab8b9d1ebad207bb7c97eb671da333e667ea0)

Benchmark	(length)	Mode	Cnt	Score	Error	Units
TraverseBench.filterList	10000	thrpt	20	3305.299	± 233.483	ops/s
TraverseBench.filterVector	10000	thrpt	20	3253.531	± 133.993	ops/s
TraverseBench.mapList	10000	thrpt	20	3308.961	± 138.697	ops/s
TraverseBench.mapVector	10000	thrpt	20	3911.187	± 142.582	ops/s
TraverseBench.traverseChain	10000	thrpt	20	594.535	± 29.777	ops/s
TraverseBench.traverseChainError	10000	thrpt	20	1084.548	± 75.686	ops/s
TraverseBench.traverseFilterChain	10000	thrpt	20	642.895	± 55.621	ops/s
TraverseBench.traverseFilterList	10000	thrpt	20	682.965	± 41.234	ops/s
TraverseBench.traverseFilterVector	10000	thrpt	20	696.235	± 59.945	ops/s
TraverseBench.traverseList	10000	thrpt	20	582.275	± 35.877	ops/s
TraverseBench.traverseListError	10000	thrpt	20	1114.959	± 52.641	ops/s
TraverseBench.traverseVector	10000	thrpt	20	638.235	± 42.533	ops/s
TraverseBench.traverseVectorError	10000	thrpt	20	1098.038	± 62.946	ops/s
TraverseBench.traverse_Chain	10000	thrpt	20	3480.086	± 178.654	ops/s
TraverseBench.traverse_List	10000	thrpt	20	3950.650	± 171.441	ops/s
TraverseBench.traverse_Vector	10000	thrpt	20	3695.495	± 182.401	ops/s

Aug 23 '23 16:08 TimWSpence

The TraverseFilter laws hard-code Option. How do we test that the new branching implementation is lawful?

Can we add some new laws in terms of OptionT[Eval, _]? Edit: I don't think this will work unless we make sure that OptionT offers a StackSafeMonad if F is a StackSafeMonad.

Should I open a separate PR for the new benchmarks first?

Nah.

Aug 23 '23 16:08 armanbilge

Can you post the results of the benchmarks? Apologies if I missed them.

Aug 23 '23 16:08 johnynek

Can you post the results of the benchmarks? Apologies if I missed them.

Yes will do. I had originally asked whether I should add the new ones in a separate PR first which was why I didn't run them straight away

Aug 23 '23 18:08 TimWSpence

@armanbilge @johnynek apologies it's taken me so long to get round to this. I've updated with benchmarks for the mutable builder approach. Are you still worried about lawfulness? Shall I try to do it with immutable data structures?

Those traverse_ improvements though 😆

Nov 05 '23 16:11 TimWSpence

Thanks for working on this. What a don't see is how a Monad such as fs2.Stream[IO, A] would interact here. In that. Case you can have laziness, concurrency and more than one output per input.

Can you give an explanation as to why we shouldn't worry about mutability in such a case?

Nov 08 '23 22:11 johnynek

Can you give an explanation as to why we shouldn't worry about mutability in such a case?

We can't have concurrency: we are sequencing the effects with flatMap, which by law must be sequential.

Nov 08 '23 23:11 armanbilge

We can't have concurrency: we are sequencing the effects with flatMap, which by law must be sequential.

This is definitely a colloquial understanding of Monads, that they represent an sequential computation, but I don't see the law that allows us to make the claim you want.

For instance, in the current PR we are using immutable items inside the monadic type constructor. I think the current writing will work. But if we changed those to be mutable, I don't see a proof that it will work, even though it may.

For instance, consider this type:

sealed trait Tree[F[_], A]
case class Item[F[_], A](fa: A) extends Tree[F, A]
case class InsideF[F[_], A](treeF: F[Tree[F, A]]) extends Tree[F, A]
case class Branch[A](left: Tree[F, A], right: Tree[F, A]) extends Tree[F, A]

object Tree {
  def pure[F[_], A](a: A): Tree[F, A] = Item(a)
  def flatMap[F[_]: Functor, A, B](fa: Tree[F, A])(fn: A => Tree[F, B]): Tree[F, B] =
    fa match {
      case Item(a) => fn(a)
      case InsideF(fa) => InsideF(fa.map(flatMap(_)(fn)))
      case Branch(left, right) =>
        Branch(flatMap(left)(fn), flatMap(right)(fn))
    }
}

Now, I haven't checked if this follows the monad laws as we have expressed them, but I think it or something like it could. In this picture note we are recursively calling flatMap inside the F context and outside of it. So, if F is lazy, we return before we have fully recursed.

Couldn't something like this cause a problem?

I think saying "monads are sequential" is a bit too hand-wavey to bolt into the default implementation for every user of cats. I think we need a stronger argument than that.

Nov 09 '23 01:11 johnynek

Oscar is right: using mutable data structures to implement traverse will not work, but perhaps surprisingly the issue is not concurrency or laziness. I added a test in 2210d8271792a14c20b5b25b7b54b82f464779f6 that fails when tested against the builder-based implementation in a1c9ff5d02f744cc6789627877c98059e6b4b3a5. There is some cheating there to pretend that List is a stacksafe monad just so that we invoke the traverseDirectly codepath.

As it turns out, the problem is actually here: https://github.com/typelevel/cats/blob/a1c9ff5d02f744cc6789627877c98059e6b4b3a5/core/src/main/scala/cats/Traverse.scala#L299-L302

For List-like monads, you may invoke that map function multiple times for each element of the List. If the accumulator is shared mutable state then this breaks down very quickly, which is exactly what that test demonstrates.

Suffice to say: my bad 😅

Nov 10 '23 21:11 armanbilge

Thanks @johnynek and @armanbilge I'm convinced that we should use immutable data structures 😆 I've run the benchmarks for the immutable vector variant. I'll do the same for immutable list and chain - I don't have a good intuition which would be best?

Also thanks for the other comments - I'll address those too :)

Nov 14 '23 12:11 TimWSpence

Apologies again for the delay on this. The blocker is that I keep forgetting to run the benchmarks overnight and I need my laptop during the day and it takes almost two hours to run.

I've still got to fix the Map instance but I'll do that once we've settled on an approach for the other types. Annoyingly I should probably re-do the Vector variant with the Applicative method changes so I still need to benchmark

Chain (c5d90a69657bd7a0cc0a654cfbcd798ed9f18274)
List (f95b0876d205f71559ef05d92cbcec8c1a700c40)
Vector (todo)

Dec 08 '23 15:12 TimWSpence

OK I've run the benchmarks for the Chain implementation and the traverse_ performance dropped significantly. But I didn't touch the implementation of traverse_... I've running these on an M2 Max. I've heard of benchmarks ending up running on its efficiency cores - could that be what happened?

Also realized that I didn't update TraverseFilter to use Chain 😭

Dec 13 '23 14:12 TimWSpence

I've running these on an M2 Max

Final warning — do not under any circumstances do this on Apple hardware.

https://www.youtube.com/watch?v=n5u7DgFwLGE&t=1541s

Dec 13 '23 16:12 armanbilge

I've running these on an M2 Max

Final warning — do not under any circumstances do this on Apple hardware.

https://www.youtube.com/watch?v=n5u7DgFwLGE&t=1541s

😆 Sadly it's all I have. Any suggestions?

Dec 14 '23 16:12 TimWSpence

OK I've now got access to a relatively chonky EC2 instance so I'll try to get the full set of benchmarks done ASAP

Dec 18 '23 14:12 TimWSpence

OK there are the benchmarks for baseline, chain and vector implementations. The traverse_ improvements are looking pretty sweet but the traverse improvements are currently very small (especially when we have to convert between data types). I'll try to investigate in the new year but if anyone has ideas in the meantime then I'd love to hear them!

Dec 22 '23 16:12 TimWSpence

Am I wrong that the Chain version seems to beat the Vector version?

Dec 24 '23 19:12 johnynek

@armanbilge @johnynek @valencik sorry for letting this languish for so long 😭 I've now reverted the last change which was just to get some baseline benchmarks and merged main in again.

Just wanted to gauge people's thoughts on the best direction from here. Assuming that my benchmarks are reliable (if anyone else wants to run them as well, that would be fantastic!), the results are a bit disappointing for traverse (baseline, chain and vector all look relatively similar. The new versions are slightly better, especially for Chain collections with the Chain implementation and similarly for Vector) but pretty great for traverse_. Where do you think we should go from here?

Mar 04 '24 15:03 TimWSpence

I listed a couple of minor nits, but I think any performance improvement is worth merging, especially around traverse since it is such a key function.

Thanks! Do you think it's worth making separate Chain and Vector implementations respectively or just implement them all with Chain as that seems to be the best overall?

Mar 04 '24 22:03 TimWSpence

I would go where the benchmarks suggest if we have confidence in them.

Mar 04 '24 23:03 johnynek

@valencik you foolishly volunteered yourself on Discord to run the benchmarks as well 😆 If you have the chance, I would massively appreciate another set of data to confirm that what I'm seeing is correct

Mar 05 '24 12:03 TimWSpence

Mostly https://github.com/typelevel/cats/pull/4498/commits/803107de2f728f648800513ef5cff242a1172fbf * ( Sorry, I actually did apply suggested change in https://github.com/typelevel/cats/pull/4498#discussion_r1512844912 )

[info] Benchmark                           (length)   Mode  Cnt     Score     Error  Units
[info] TraverseBench.filterList               10000  thrpt   20  5683.590 ±  73.700  ops/s
[info] TraverseBench.filterVector             10000  thrpt   20  5665.475 ±  34.436  ops/s
[info] TraverseBench.mapList                  10000  thrpt   20  5833.001 ±  45.470  ops/s
[info] TraverseBench.mapVector                10000  thrpt   20  6206.558 ±  45.317  ops/s
[info] TraverseBench.traverseChain            10000  thrpt   20  1157.471 ±   8.626  ops/s
[info] TraverseBench.traverseChainError       10000  thrpt   20  2080.678 ±  15.341  ops/s
[info] TraverseBench.traverseFilterChain      10000  thrpt   20  1201.729 ±   6.121  ops/s
[info] TraverseBench.traverseFilterList       10000  thrpt   20  1143.251 ±   8.666  ops/s
[info] TraverseBench.traverseFilterVector     10000  thrpt   20  1232.990 ±   7.090  ops/s
[info] TraverseBench.traverseList             10000  thrpt   20  1085.565 ±   8.561  ops/s
[info] TraverseBench.traverseListError        10000  thrpt   20  2192.903 ±   9.450  ops/s
[info] TraverseBench.traverseVector           10000  thrpt   20  1162.033 ±  22.101  ops/s
[info] TraverseBench.traverseVectorError      10000  thrpt   20  2178.812 ±  22.956  ops/s
[info] TraverseBench.traverse_Chain           10000  thrpt   20  7644.606 ± 151.222  ops/s
[info] TraverseBench.traverse_List            10000  thrpt   20  8946.374 ± 124.209  ops/s
[info] TraverseBench.traverse_Vector          10000  thrpt   20  9036.810 ±  37.314  ops/s

and against main (https://github.com/typelevel/cats/commit/d323facbfcfac7803ed85f75e16e302f68c4ad45) but with TraverseBench.scala from https://github.com/typelevel/cats/pull/4498/commits/803107de2f728f648800513ef5cff242a1172fbf:

[info] Benchmark                           (length)   Mode  Cnt     Score     Error  Units
[info] TraverseBench.filterList               10000  thrpt   20  5781.907 ±  31.753  ops/s
[info] TraverseBench.filterVector             10000  thrpt   20  5667.024 ±  53.352  ops/s
[info] TraverseBench.mapList                  10000  thrpt   20  5864.307 ±  45.807  ops/s
[info] TraverseBench.mapVector                10000  thrpt   20  6245.999 ±  46.899  ops/s
[info] TraverseBench.traverseChain            10000  thrpt   20  1072.360 ±  25.156  ops/s
[info] TraverseBench.traverseChainError       10000  thrpt   20  2716.055 ± 100.545  ops/s
[info] TraverseBench.traverseFilterChain      10000  thrpt   20  1029.492 ±  51.776  ops/s
[info] TraverseBench.traverseFilterList       10000  thrpt   20   928.622 ±   3.589  ops/s
[info] TraverseBench.traverseFilterVector     10000  thrpt   20  1002.081 ±  44.357  ops/s
[info] TraverseBench.traverseList             10000  thrpt   20   948.305 ±  26.980  ops/s
[info] TraverseBench.traverseListError        10000  thrpt   20  2722.420 ±  77.210  ops/s
[info] TraverseBench.traverseVector           10000  thrpt   20  1011.813 ±  19.254  ops/s
[info] TraverseBench.traverseVectorError      10000  thrpt   20  3366.467 ±  35.241  ops/s
[info] TraverseBench.traverse_Chain           10000  thrpt   20   876.282 ± 100.385  ops/s
[info] TraverseBench.traverse_List            10000  thrpt   20   779.133 ±  14.965  ops/s
[info] TraverseBench.traverse_Vector          10000  thrpt   20  1093.288 ±  45.168  ops/s

By the looks of things filterList, filterVector, mapList, and mapVector are all within errror when comparing this PR against latest main. We have small improvements for traverseChain, traverseFilterChain, traverseFilterList, traverseFilterVector, traverseList, and traverseVector. Medium regressions for traverseChainError, traverseListError, traverseVectorError. And then very large improvements for traverse_Chain, traverse_List, and traverse_Vector.

Summarized differently:

🏎️ traverse_ 🏃 traverse 🏃 traverseFilter 🆗 filter 🆗 map 🐢 traverseError

Mar 06 '24 13:03 valencik

Wanted to confirm things, so I ran again and came up with the same results.

I ran these benchmarks (both previously and here) with the following in sbt: bench/jmh:run -i 10 -wi 10 -f 2 -t 1 cats.bench.TraverseBench

And with the following jvmopts: -Xms2G -Xmx12G

Details

on exactly https://github.com/typelevel/cats/pull/4498/commits/803107de2f728f648800513ef5cff242a1172fbf

[info] Benchmark                           (length)   Mode  Cnt     Score    Error  Units
[info] TraverseBench.filterList               10000  thrpt   20  5715.626 ±  29.321  ops/s
[info] TraverseBench.filterVector             10000  thrpt   20  5546.558 ±  19.276  ops/s
[info] TraverseBench.mapList                  10000  thrpt   20  5767.038 ±  34.619  ops/s
[info] TraverseBench.mapVector                10000  thrpt   20  6173.065 ±  40.277  ops/s
[info] TraverseBench.traverseChain            10000  thrpt   20  1123.427 ±   4.987  ops/s
[info] TraverseBench.traverseChainError       10000  thrpt   20  2127.662 ±  13.144  ops/s
[info] TraverseBench.traverseFilterChain      10000  thrpt   20  1180.463 ±   5.585  ops/s
[info] TraverseBench.traverseFilterList       10000  thrpt   20  1206.215 ±   9.615  ops/s
[info] TraverseBench.traverseFilterVector     10000  thrpt   20  1244.894 ±  25.955  ops/s
[info] TraverseBench.traverseList             10000  thrpt   20  1069.802 ±  13.869  ops/s
[info] TraverseBench.traverseListError        10000  thrpt   20  2167.488 ±  21.981  ops/s
[info] TraverseBench.traverseVector           10000  thrpt   20  1152.973 ±  25.078  ops/s
[info] TraverseBench.traverseVectorError      10000  thrpt   20  2207.517 ±  33.643  ops/s
[info] TraverseBench.traverse_Chain           10000  thrpt   20  7912.984 ± 107.098  ops/s
[info] TraverseBench.traverse_List            10000  thrpt   20  8934.534 ±  46.916  ops/s
[info] TraverseBench.traverse_Vector          10000  thrpt   20  9092.209 ±  36.689  ops/s

and on main:

[info] Benchmark                           (length)   Mode  Cnt     Score    Error  Units
[info] TraverseBench.filterList               10000  thrpt   20  5710.935 ± 65.052  ops/s
[info] TraverseBench.filterVector             10000  thrpt   20  5586.684 ± 53.060  ops/s
[info] TraverseBench.mapList                  10000  thrpt   20  5794.559 ± 31.269  ops/s
[info] TraverseBench.mapVector                10000  thrpt   20  6156.259 ± 29.844  ops/s
[info] TraverseBench.traverseChain            10000  thrpt   20  1092.330 ±  7.171  ops/s
[info] TraverseBench.traverseChainError       10000  thrpt   20  2713.055 ± 41.359  ops/s
[info] TraverseBench.traverseFilterChain      10000  thrpt   20   982.057 ±  6.179  ops/s
[info] TraverseBench.traverseFilterList       10000  thrpt   20   873.508 ± 11.078  ops/s
[info] TraverseBench.traverseFilterVector     10000  thrpt   20   986.796 ± 41.753  ops/s
[info] TraverseBench.traverseList             10000  thrpt   20   918.397 ± 39.014  ops/s
[info] TraverseBench.traverseListError        10000  thrpt   20  2781.320 ± 15.372  ops/s
[info] TraverseBench.traverseVector           10000  thrpt   20   997.688 ±  4.798  ops/s
[info] TraverseBench.traverseVectorError      10000  thrpt   20  3081.201 ± 31.311  ops/s
[info] TraverseBench.traverse_Chain           10000  thrpt   20   939.052 ± 20.897  ops/s
[info] TraverseBench.traverse_List            10000  thrpt   20   808.780 ±  8.011  ops/s
[info] TraverseBench.traverse_Vector          10000  thrpt   20  1111.320 ±  9.188  ops/s

Mar 06 '24 23:03 valencik

@valencik absolute legend ❤️ Couple of points:

I'm dumb and not thinking straight 😅 I was thinking the Map implementation would be dependent on either Chain or Vector and so I was waiting for the outcome of that debate. I'll fix it now
map and filter should be unchanged by this so all of the results apart from traverseError are expected, right? It's certainly not obvious to me why traverseError is worse now...
Do you think it's worth having separate implementations for Vector and Chain? From my results they seemed to be noticeably better when we're not doing a conversion from one to the other

Mar 07 '24 17:03 TimWSpence

@valencik sorry to bother you. Just tagging you in case you missed my previous comment

Mar 18 '24 10:03 TimWSpence

2. map and filter should be unchanged by this so all of the results apart from traverseError are expected, right? It's certainly not obvious to me why traverseError is worse now...

I agree, and similarly, I also do not know why traverseError would be worse now.

3. Do you think it's worth having separate implementations for Vector and Chain? From my results they seemed to be noticeably better when we're not doing a conversion from one to the other

This almost seems like followup work to me? I think, if we are fine with the traverseError regression, we should merge this work (or fix traverseError and then merge). This already is only one part of https://github.com/typelevel/cats/issues/4408, we can have incremental improvements follow it.

Mar 20 '24 23:03 valencik

map and filter should be unchanged by this so all of the results apart from traverseError are expected, right? It's certainly not obvious to me why traverseError is worse now...

I agree, and similarly, I also do not know why traverseError would be worse now.

Do you think it's worth having separate implementations for Vector and Chain? From my results they seemed to be noticeably better when we're not doing a conversion from one to the other

This almost seems like followup work to me? I think, if we are fine with the traverseError regression, we should merge this work (or fix traverseError and then merge). This already is only one part of #4408, we can have incremental improvements follow it.

Yeah I'm more than happy to draw a line under this PR and follow up separately - it's been open far too long 😆 I'll move it out of draft now

Mar 21 '24 16:03 TimWSpence

Argh @valencik I've just realized the current HEAD uses Vector rather than Chain. From what I saw, Chain seemed to be better overall. Shall I switch it back? I think that was @johnynek's view as well?

Mar 22 '24 10:03 TimWSpence

Argh @valencik I've just realized the current HEAD uses Vector rather than Chain. From what I saw, Chain seemed to be better overall. Shall I switch it back? I think that was @johnynek's view as well?

Just to double check I understand, do you mean the implementation of traverseDirectly?

Yeah, let's switch that to Chain and I can rerun benchmarks.

Mar 22 '24 12:03 valencik

Currently running benchmarks again on a tweak which uses Chain in traverseDirectly and traverseFilterDirectly https://github.com/typelevel/cats/compare/803107de2f728f648800513ef5cff242a1172fbf...valencik:cats:more-chain?expand=1

Mar 22 '24 13:03 valencik

Ok, I ran the benchmarks, I think Chain is better, and put up a PR to your branch/fork @TimWSpence https://github.com/TimWSpence/cats/pull/2

Feel free to leave it for a day or two. There's a ton of numbers here. I wouldn't be against taking another look over things.

I also ran a test benchmark run with a traverseDirectlyVector variant, and I think that shows some mild improvements. But again, maybe we should save that for a follow up.

Mar 22 '24 23:03 valencik

cats cats copied to clipboard

Optimize traverse

Baseline (b08196e8526f6025d009ac0c8efc0a0e0ebcb1c4)

Chain (2d5f4d7917804e921488f5f2e263e8349a6ceb42)

Vector (702ab8b9d1ebad207bb7c97eb671da333e667ea0)

cats
cats copied to clipboard