cats
cats copied to clipboard
Optimize traverse
This addresses the StackSafeMonad part of https://github.com/typelevel/cats/issues/4408. I thought that there were enough changes in this to merit a separate PR for the Defer part.
Open questions (will edit these out of the description later):
- The
TraverseFilterlaws hard-codeOption. How do we test that the new branching implementation is lawful?
The following benchmarks were run on an AWS t3.xlarge:
Baseline (b08196e8526f6025d009ac0c8efc0a0e0ebcb1c4)
| Benchmark | (length) | Mode | Cnt | Score | Error | Units |
|---|---|---|---|---|---|---|
| TraverseBench.filterList | 10000 | thrpt | 20 | 3247.486 | ± 270.817 | ops/s |
| TraverseBench.filterVector | 10000 | thrpt | 20 | 3351.585 | ± 255.883 | ops/s |
| TraverseBench.mapList | 10000 | thrpt | 20 | 3198.905 | ± 235.865 | ops/s |
| TraverseBench.mapVector | 10000 | thrpt | 20 | 3906.339 | ± 225.508 | ops/s |
| TraverseBench.traverseChain | 10000 | thrpt | 20 | 556.706 | ± 31.814 | ops/s |
| TraverseBench.traverseChainError | 10000 | thrpt | 20 | 1278.954 | ± 74.242 | ops/s |
| TraverseBench.traverseFilterChain | 10000 | thrpt | 20 | 536.841 | ± 38.696 | ops/s |
| TraverseBench.traverseFilterList | 10000 | thrpt | 20 | 532.103 | ± 30.473 | ops/s |
| TraverseBench.traverseFilterVector | 10000 | thrpt | 20 | 590.312 | ± 29.534 | ops/s |
| TraverseBench.traverseList | 10000 | thrpt | 20 | 537.023 | ± 33.287 | ops/s |
| TraverseBench.traverseListError | 10000 | thrpt | 20 | 1674.930 | ± 48.771 | ops/s |
| TraverseBench.traverseVector | 10000 | thrpt | 20 | 581.844 | ± 32.074 | ops/s |
| TraverseBench.traverseVectorError | 10000 | thrpt | 20 | 1778.334 | ± 135.859 | ops/s |
| TraverseBench.traverse_Chain | 10000 | thrpt | 20 | 490.822 | ± 14.145 | ops/s |
| TraverseBench.traverse_List | 10000 | thrpt | 20 | 474.568 | ± 28.995 | ops/s |
| TraverseBench.traverse_Vector | 10000 | thrpt | 20 | 622.924 | ± 38.002 | ops/s |
Chain (2d5f4d7917804e921488f5f2e263e8349a6ceb42)
| Benchmark | (length) | Mode | Cnt | Score | Error | Units |
|---|---|---|---|---|---|---|
| TraverseBench.filterList | 10000 | thrpt | 20 | 3170.974 | ± 146.628 | ops/s |
| TraverseBench.filterVector | 10000 | thrpt | 20 | 3217.881 | ± 181.087 | ops/s |
| TraverseBench.mapList | 10000 | thrpt | 20 | 3443.969 | ± 104.117 | ops/s |
| TraverseBench.mapVector | 10000 | thrpt | 20 | 3827.401 | ± 148.074 | ops/s |
| TraverseBench.traverseChain | 10000 | thrpt | 20 | 784.845 | ± 25.990 | ops/s |
| TraverseBench.traverseChainError | 10000 | thrpt | 20 | 1302.782 | ± 62.342 | ops/s |
| TraverseBench.traverseFilterChain | 10000 | thrpt | 20 | 748.986 | ± 62.151 | ops/s |
| TraverseBench.traverseFilterList | 10000 | thrpt | 20 | 669.817 | ± 41.798 | ops/s |
| TraverseBench.traverseFilterVector | 10000 | thrpt | 20 | 618.644 | ± 28.558 | ops/s |
| TraverseBench.traverseList | 10000 | thrpt | 20 | 583.413 | ± 27.079 | ops/s |
| TraverseBench.traverseListError | 10000 | thrpt | 20 | 1253.510 | ± 76.328 | ops/s |
| TraverseBench.traverseVector | 10000 | thrpt | 20 | 581.975 | ± 27.619 | ops/s |
| TraverseBench.traverseVectorError | 10000 | thrpt | 20 | 1302.362 | ± 66.015 | ops/s |
| TraverseBench.traverse_Chain | 10000 | thrpt | 20 | 3598.033 | ± 108.420 | ops/s |
| TraverseBench.traverse_List | 10000 | thrpt | 20 | 4048.287 | ± 158.090 | ops/s |
| TraverseBench.traverse_Vector | 10000 | thrpt | 20 | 4094.024 | ± 101.115 | ops/s |
Vector (702ab8b9d1ebad207bb7c97eb671da333e667ea0)
| Benchmark | (length) | Mode | Cnt | Score | Error | Units |
|---|---|---|---|---|---|---|
| TraverseBench.filterList | 10000 | thrpt | 20 | 3305.299 | ± 233.483 | ops/s |
| TraverseBench.filterVector | 10000 | thrpt | 20 | 3253.531 | ± 133.993 | ops/s |
| TraverseBench.mapList | 10000 | thrpt | 20 | 3308.961 | ± 138.697 | ops/s |
| TraverseBench.mapVector | 10000 | thrpt | 20 | 3911.187 | ± 142.582 | ops/s |
| TraverseBench.traverseChain | 10000 | thrpt | 20 | 594.535 | ± 29.777 | ops/s |
| TraverseBench.traverseChainError | 10000 | thrpt | 20 | 1084.548 | ± 75.686 | ops/s |
| TraverseBench.traverseFilterChain | 10000 | thrpt | 20 | 642.895 | ± 55.621 | ops/s |
| TraverseBench.traverseFilterList | 10000 | thrpt | 20 | 682.965 | ± 41.234 | ops/s |
| TraverseBench.traverseFilterVector | 10000 | thrpt | 20 | 696.235 | ± 59.945 | ops/s |
| TraverseBench.traverseList | 10000 | thrpt | 20 | 582.275 | ± 35.877 | ops/s |
| TraverseBench.traverseListError | 10000 | thrpt | 20 | 1114.959 | ± 52.641 | ops/s |
| TraverseBench.traverseVector | 10000 | thrpt | 20 | 638.235 | ± 42.533 | ops/s |
| TraverseBench.traverseVectorError | 10000 | thrpt | 20 | 1098.038 | ± 62.946 | ops/s |
| TraverseBench.traverse_Chain | 10000 | thrpt | 20 | 3480.086 | ± 178.654 | ops/s |
| TraverseBench.traverse_List | 10000 | thrpt | 20 | 3950.650 | ± 171.441 | ops/s |
| TraverseBench.traverse_Vector | 10000 | thrpt | 20 | 3695.495 | ± 182.401 | ops/s |
- The
TraverseFilterlaws hard-codeOption. How do we test that the new branching implementation is lawful?
Can we add some new laws in terms of OptionT[Eval, _]? Edit: I don't think this will work unless we make sure that OptionT offers a StackSafeMonad if F is a StackSafeMonad.
Should I open a separate PR for the new benchmarks first?
Nah.
Can you post the results of the benchmarks? Apologies if I missed them.
Can you post the results of the benchmarks? Apologies if I missed them.
Yes will do. I had originally asked whether I should add the new ones in a separate PR first which was why I didn't run them straight away
@armanbilge @johnynek apologies it's taken me so long to get round to this. I've updated with benchmarks for the mutable builder approach. Are you still worried about lawfulness? Shall I try to do it with immutable data structures?
Those traverse_ improvements though 😆
Thanks for working on this. What a don't see is how a Monad such as fs2.Stream[IO, A] would interact here. In that. Case you can have laziness, concurrency and more than one output per input.
Can you give an explanation as to why we shouldn't worry about mutability in such a case?
Can you give an explanation as to why we shouldn't worry about mutability in such a case?
We can't have concurrency: we are sequencing the effects with flatMap, which by law must be sequential.
We can't have concurrency: we are sequencing the effects with flatMap, which by law must be sequential.
This is definitely a colloquial understanding of Monads, that they represent an sequential computation, but I don't see the law that allows us to make the claim you want.
For instance, in the current PR we are using immutable items inside the monadic type constructor. I think the current writing will work. But if we changed those to be mutable, I don't see a proof that it will work, even though it may.
For instance, consider this type:
sealed trait Tree[F[_], A]
case class Item[F[_], A](fa: A) extends Tree[F, A]
case class InsideF[F[_], A](treeF: F[Tree[F, A]]) extends Tree[F, A]
case class Branch[A](left: Tree[F, A], right: Tree[F, A]) extends Tree[F, A]
object Tree {
def pure[F[_], A](a: A): Tree[F, A] = Item(a)
def flatMap[F[_]: Functor, A, B](fa: Tree[F, A])(fn: A => Tree[F, B]): Tree[F, B] =
fa match {
case Item(a) => fn(a)
case InsideF(fa) => InsideF(fa.map(flatMap(_)(fn)))
case Branch(left, right) =>
Branch(flatMap(left)(fn), flatMap(right)(fn))
}
}
Now, I haven't checked if this follows the monad laws as we have expressed them, but I think it or something like it could. In this picture note we are recursively calling flatMap inside the F context and outside of it. So, if F is lazy, we return before we have fully recursed.
Couldn't something like this cause a problem?
I think saying "monads are sequential" is a bit too hand-wavey to bolt into the default implementation for every user of cats. I think we need a stronger argument than that.
Oscar is right: using mutable data structures to implement traverse will not work, but perhaps surprisingly the issue is not concurrency or laziness. I added a test in 2210d8271792a14c20b5b25b7b54b82f464779f6 that fails when tested against the builder-based implementation in a1c9ff5d02f744cc6789627877c98059e6b4b3a5. There is some cheating there to pretend that List is a stacksafe monad just so that we invoke the traverseDirectly codepath.
As it turns out, the problem is actually here: https://github.com/typelevel/cats/blob/a1c9ff5d02f744cc6789627877c98059e6b4b3a5/core/src/main/scala/cats/Traverse.scala#L299-L302
For List-like monads, you may invoke that map function multiple times for each element of the List. If the accumulator is shared mutable state then this breaks down very quickly, which is exactly what that test demonstrates.
Suffice to say: my bad 😅
Thanks @johnynek and @armanbilge I'm convinced that we should use immutable data structures 😆 I've run the benchmarks for the immutable vector variant. I'll do the same for immutable list and chain - I don't have a good intuition which would be best?
Also thanks for the other comments - I'll address those too :)
Apologies again for the delay on this. The blocker is that I keep forgetting to run the benchmarks overnight and I need my laptop during the day and it takes almost two hours to run.
I've still got to fix the Map instance but I'll do that once we've settled on an approach for the other types. Annoyingly I should probably re-do the Vector variant with the Applicative method changes so I still need to benchmark
Chain(c5d90a69657bd7a0cc0a654cfbcd798ed9f18274)List(f95b0876d205f71559ef05d92cbcec8c1a700c40)Vector(todo)
OK I've run the benchmarks for the Chain implementation and the traverse_ performance dropped significantly. But I didn't touch the implementation of traverse_... I've running these on an M2 Max. I've heard of benchmarks ending up running on its efficiency cores - could that be what happened?
Also realized that I didn't update TraverseFilter to use Chain 😭
I've running these on an M2 Max
Final warning — do not under any circumstances do this on Apple hardware.
https://www.youtube.com/watch?v=n5u7DgFwLGE&t=1541s
I've running these on an M2 Max
![]()
Final warning — do not under any circumstances do this on Apple hardware.
https://www.youtube.com/watch?v=n5u7DgFwLGE&t=1541s
😆 Sadly it's all I have. Any suggestions?
OK I've now got access to a relatively chonky EC2 instance so I'll try to get the full set of benchmarks done ASAP
OK there are the benchmarks for baseline, chain and vector implementations. The traverse_ improvements are looking pretty sweet but the traverse improvements are currently very small (especially when we have to convert between data types). I'll try to investigate in the new year but if anyone has ideas in the meantime then I'd love to hear them!
Am I wrong that the Chain version seems to beat the Vector version?
@armanbilge @johnynek @valencik sorry for letting this languish for so long 😭 I've now reverted the last change which was just to get some baseline benchmarks and merged main in again.
Just wanted to gauge people's thoughts on the best direction from here. Assuming that my benchmarks are reliable (if anyone else wants to run them as well, that would be fantastic!), the results are a bit disappointing for traverse (baseline, chain and vector all look relatively similar. The new versions are slightly better, especially for Chain collections with the Chain implementation and similarly for Vector) but pretty great for traverse_. Where do you think we should go from here?
I listed a couple of minor nits, but I think any performance improvement is worth merging, especially around traverse since it is such a key function.
Thanks! Do you think it's worth making separate Chain and Vector implementations respectively or just implement them all with Chain as that seems to be the best overall?
I would go where the benchmarks suggest if we have confidence in them.
@valencik you foolishly volunteered yourself on Discord to run the benchmarks as well 😆 If you have the chance, I would massively appreciate another set of data to confirm that what I'm seeing is correct
Mostly https://github.com/typelevel/cats/pull/4498/commits/803107de2f728f648800513ef5cff242a1172fbf * ( Sorry, I actually did apply suggested change in https://github.com/typelevel/cats/pull/4498#discussion_r1512844912 )
[info] Benchmark (length) Mode Cnt Score Error Units
[info] TraverseBench.filterList 10000 thrpt 20 5683.590 ± 73.700 ops/s
[info] TraverseBench.filterVector 10000 thrpt 20 5665.475 ± 34.436 ops/s
[info] TraverseBench.mapList 10000 thrpt 20 5833.001 ± 45.470 ops/s
[info] TraverseBench.mapVector 10000 thrpt 20 6206.558 ± 45.317 ops/s
[info] TraverseBench.traverseChain 10000 thrpt 20 1157.471 ± 8.626 ops/s
[info] TraverseBench.traverseChainError 10000 thrpt 20 2080.678 ± 15.341 ops/s
[info] TraverseBench.traverseFilterChain 10000 thrpt 20 1201.729 ± 6.121 ops/s
[info] TraverseBench.traverseFilterList 10000 thrpt 20 1143.251 ± 8.666 ops/s
[info] TraverseBench.traverseFilterVector 10000 thrpt 20 1232.990 ± 7.090 ops/s
[info] TraverseBench.traverseList 10000 thrpt 20 1085.565 ± 8.561 ops/s
[info] TraverseBench.traverseListError 10000 thrpt 20 2192.903 ± 9.450 ops/s
[info] TraverseBench.traverseVector 10000 thrpt 20 1162.033 ± 22.101 ops/s
[info] TraverseBench.traverseVectorError 10000 thrpt 20 2178.812 ± 22.956 ops/s
[info] TraverseBench.traverse_Chain 10000 thrpt 20 7644.606 ± 151.222 ops/s
[info] TraverseBench.traverse_List 10000 thrpt 20 8946.374 ± 124.209 ops/s
[info] TraverseBench.traverse_Vector 10000 thrpt 20 9036.810 ± 37.314 ops/s
and against main (https://github.com/typelevel/cats/commit/d323facbfcfac7803ed85f75e16e302f68c4ad45) but with TraverseBench.scala from https://github.com/typelevel/cats/pull/4498/commits/803107de2f728f648800513ef5cff242a1172fbf:
[info] Benchmark (length) Mode Cnt Score Error Units
[info] TraverseBench.filterList 10000 thrpt 20 5781.907 ± 31.753 ops/s
[info] TraverseBench.filterVector 10000 thrpt 20 5667.024 ± 53.352 ops/s
[info] TraverseBench.mapList 10000 thrpt 20 5864.307 ± 45.807 ops/s
[info] TraverseBench.mapVector 10000 thrpt 20 6245.999 ± 46.899 ops/s
[info] TraverseBench.traverseChain 10000 thrpt 20 1072.360 ± 25.156 ops/s
[info] TraverseBench.traverseChainError 10000 thrpt 20 2716.055 ± 100.545 ops/s
[info] TraverseBench.traverseFilterChain 10000 thrpt 20 1029.492 ± 51.776 ops/s
[info] TraverseBench.traverseFilterList 10000 thrpt 20 928.622 ± 3.589 ops/s
[info] TraverseBench.traverseFilterVector 10000 thrpt 20 1002.081 ± 44.357 ops/s
[info] TraverseBench.traverseList 10000 thrpt 20 948.305 ± 26.980 ops/s
[info] TraverseBench.traverseListError 10000 thrpt 20 2722.420 ± 77.210 ops/s
[info] TraverseBench.traverseVector 10000 thrpt 20 1011.813 ± 19.254 ops/s
[info] TraverseBench.traverseVectorError 10000 thrpt 20 3366.467 ± 35.241 ops/s
[info] TraverseBench.traverse_Chain 10000 thrpt 20 876.282 ± 100.385 ops/s
[info] TraverseBench.traverse_List 10000 thrpt 20 779.133 ± 14.965 ops/s
[info] TraverseBench.traverse_Vector 10000 thrpt 20 1093.288 ± 45.168 ops/s
By the looks of things filterList, filterVector, mapList, and mapVector are all within errror when comparing this PR against latest main.
We have small improvements for traverseChain, traverseFilterChain, traverseFilterList, traverseFilterVector, traverseList, and traverseVector.
Medium regressions for traverseChainError, traverseListError, traverseVectorError.
And then very large improvements for traverse_Chain, traverse_List, and traverse_Vector.
Summarized differently:
🏎️ traverse_
🏃 traverse
🏃 traverseFilter
🆗 filter
🆗 map
🐢 traverseError
Wanted to confirm things, so I ran again and came up with the same results.
I ran these benchmarks (both previously and here) with the following in sbt:
bench/jmh:run -i 10 -wi 10 -f 2 -t 1 cats.bench.TraverseBench
And with the following jvmopts: -Xms2G -Xmx12G
Details
on exactly https://github.com/typelevel/cats/pull/4498/commits/803107de2f728f648800513ef5cff242a1172fbf
[info] Benchmark (length) Mode Cnt Score Error Units
[info] TraverseBench.filterList 10000 thrpt 20 5715.626 ± 29.321 ops/s
[info] TraverseBench.filterVector 10000 thrpt 20 5546.558 ± 19.276 ops/s
[info] TraverseBench.mapList 10000 thrpt 20 5767.038 ± 34.619 ops/s
[info] TraverseBench.mapVector 10000 thrpt 20 6173.065 ± 40.277 ops/s
[info] TraverseBench.traverseChain 10000 thrpt 20 1123.427 ± 4.987 ops/s
[info] TraverseBench.traverseChainError 10000 thrpt 20 2127.662 ± 13.144 ops/s
[info] TraverseBench.traverseFilterChain 10000 thrpt 20 1180.463 ± 5.585 ops/s
[info] TraverseBench.traverseFilterList 10000 thrpt 20 1206.215 ± 9.615 ops/s
[info] TraverseBench.traverseFilterVector 10000 thrpt 20 1244.894 ± 25.955 ops/s
[info] TraverseBench.traverseList 10000 thrpt 20 1069.802 ± 13.869 ops/s
[info] TraverseBench.traverseListError 10000 thrpt 20 2167.488 ± 21.981 ops/s
[info] TraverseBench.traverseVector 10000 thrpt 20 1152.973 ± 25.078 ops/s
[info] TraverseBench.traverseVectorError 10000 thrpt 20 2207.517 ± 33.643 ops/s
[info] TraverseBench.traverse_Chain 10000 thrpt 20 7912.984 ± 107.098 ops/s
[info] TraverseBench.traverse_List 10000 thrpt 20 8934.534 ± 46.916 ops/s
[info] TraverseBench.traverse_Vector 10000 thrpt 20 9092.209 ± 36.689 ops/s
and on main:
[info] Benchmark (length) Mode Cnt Score Error Units
[info] TraverseBench.filterList 10000 thrpt 20 5710.935 ± 65.052 ops/s
[info] TraverseBench.filterVector 10000 thrpt 20 5586.684 ± 53.060 ops/s
[info] TraverseBench.mapList 10000 thrpt 20 5794.559 ± 31.269 ops/s
[info] TraverseBench.mapVector 10000 thrpt 20 6156.259 ± 29.844 ops/s
[info] TraverseBench.traverseChain 10000 thrpt 20 1092.330 ± 7.171 ops/s
[info] TraverseBench.traverseChainError 10000 thrpt 20 2713.055 ± 41.359 ops/s
[info] TraverseBench.traverseFilterChain 10000 thrpt 20 982.057 ± 6.179 ops/s
[info] TraverseBench.traverseFilterList 10000 thrpt 20 873.508 ± 11.078 ops/s
[info] TraverseBench.traverseFilterVector 10000 thrpt 20 986.796 ± 41.753 ops/s
[info] TraverseBench.traverseList 10000 thrpt 20 918.397 ± 39.014 ops/s
[info] TraverseBench.traverseListError 10000 thrpt 20 2781.320 ± 15.372 ops/s
[info] TraverseBench.traverseVector 10000 thrpt 20 997.688 ± 4.798 ops/s
[info] TraverseBench.traverseVectorError 10000 thrpt 20 3081.201 ± 31.311 ops/s
[info] TraverseBench.traverse_Chain 10000 thrpt 20 939.052 ± 20.897 ops/s
[info] TraverseBench.traverse_List 10000 thrpt 20 808.780 ± 8.011 ops/s
[info] TraverseBench.traverse_Vector 10000 thrpt 20 1111.320 ± 9.188 ops/s
@valencik absolute legend ❤️ Couple of points:
- I'm dumb and not thinking straight 😅 I was thinking the
Mapimplementation would be dependent on eitherChainorVectorand so I was waiting for the outcome of that debate. I'll fix it now mapandfiltershould be unchanged by this so all of the results apart fromtraverseErrorare expected, right? It's certainly not obvious to me whytraverseErroris worse now...- Do you think it's worth having separate implementations for
VectorandChain? From my results they seemed to be noticeably better when we're not doing a conversion from one to the other
@valencik sorry to bother you. Just tagging you in case you missed my previous comment
2.
mapandfiltershould be unchanged by this so all of the results apart fromtraverseErrorare expected, right? It's certainly not obvious to me whytraverseErroris worse now...
I agree, and similarly, I also do not know why traverseError would be worse now.
3. Do you think it's worth having separate implementations for
VectorandChain? From my results they seemed to be noticeably better when we're not doing a conversion from one to the other
This almost seems like followup work to me? I think, if we are fine with the traverseError regression, we should merge this work (or fix traverseError and then merge). This already is only one part of https://github.com/typelevel/cats/issues/4408, we can have incremental improvements follow it.
mapandfiltershould be unchanged by this so all of the results apart fromtraverseErrorare expected, right? It's certainly not obvious to me whytraverseErroris worse now...I agree, and similarly, I also do not know why
traverseErrorwould be worse now.
- Do you think it's worth having separate implementations for
VectorandChain? From my results they seemed to be noticeably better when we're not doing a conversion from one to the otherThis almost seems like followup work to me? I think, if we are fine with the
traverseErrorregression, we should merge this work (or fixtraverseErrorand then merge). This already is only one part of #4408, we can have incremental improvements follow it.
Yeah I'm more than happy to draw a line under this PR and follow up separately - it's been open far too long 😆 I'll move it out of draft now
Argh @valencik I've just realized the current HEAD uses Vector rather than Chain. From what I saw, Chain seemed to be better overall. Shall I switch it back? I think that was @johnynek's view as well?
Argh @valencik I've just realized the current HEAD uses
Vectorrather thanChain. From what I saw,Chainseemed to be better overall. Shall I switch it back? I think that was @johnynek's view as well?
Just to double check I understand, do you mean the implementation of traverseDirectly?
Yeah, let's switch that to Chain and I can rerun benchmarks.
Currently running benchmarks again on a tweak which uses Chain in traverseDirectly and traverseFilterDirectly
https://github.com/typelevel/cats/compare/803107de2f728f648800513ef5cff242a1172fbf...valencik:cats:more-chain?expand=1
Ok, I ran the benchmarks, I think Chain is better, and put up a PR to your branch/fork @TimWSpence
https://github.com/TimWSpence/cats/pull/2
Feel free to leave it for a day or two. There's a ton of numbers here. I wouldn't be against taking another look over things.
I also ran a test benchmark run with a traverseDirectlyVector variant, and I think that shows some mild improvements.
But again, maybe we should save that for a follow up.