chore: Add support for controlling the NettyTransport's byteBuf alloc…
…ator type. (#1707)
-
chore: Add support for controlling the NettyTransport's byteBuf allocator type.
-
chore: extract deriveByteBufAllocator method
(cherry picked from commit dbc9ed3a99b0d97a697b705ceb927c48d72eea90)
related thread: https://lists.apache.org/thread/6lj5lbg18hz4n9zckn0b89p9mv9nz14g
This is grey area for me. Whether a change like this be in a patch. If it is really useful then I'd support merging it. That said, I'd prefer if the Flink team test with a 1.2 snapshot and then if they confirm that they get the behaviour that they want then we can merge this.
I'm the type of person who puts more faith in opinions if they are backed up by work like in testing Flink with a Pekko snapshot that supports this config before we go to the effort of doing releases.
Example build with this change.
https://repository.apache.org/content/groups/snapshots/org/apache/pekko/pekko-actor_2.13/1.2.0-M0+55-a75bc7a7-SNAPSHOT/
@pjfanning I still think we should merge this and make sure we get it in the next 1.1.4 to help the Flink community.
refs: https://github.com/apache/flink/pull/26042 too
I am -1 on this. Let the Flink team approach us and convince us. I have seen noone say that this is a must for Flink.
@pjfanning A brief explanation of why this is generally needed: Flink uses Netty not only for RPC but also for data shuffling. Migrating to Netty 4 for RPC caused some tests to become flaky, primarily due to differences in memory allocation in the newer version. The safest option for us would be to set Pekko allocation to unpooled, as this would roughly replicate the memory requirements of the previous version and thereby reduce the probability of tipping containers at the memory limit into OOM in production. The problem is that we cannot apply this as a global setting, because other parts of the framework, such as the core data exchange network stack, also use Netty 4 and would consequently be affected by the change.
@afedulov have you tested with the pekko 1.2 snapshot jars that already support this?
@pjfanning I just created a local Pekko build (more details in https://github.com/apache/flink/pull/25955) from the 1.1.x branch with this PR applied on top and executed our failing test with setting the bytebuf-allocator-type to unpooled. With that setting, at least on my machine the failing test was successful from 5/5 runs.
So we believe this config option would be useful for us if you guys can get it into 1.1.4. From our side it is not something that is urgent, but it could be useful for use if in case there will be a 1.1.4 release, this change would be part of it.
This PR is not merged. You can only get this support by using the latest 1.2 snapshots.
See https://github.com/apache/pekko/pull/1709#issuecomment-2599698240
This PR is not merged. You can only get this support by using the latest 1.2 snapshots.
See https://github.com/apache/pekko/pull/1709#issuecomment-2599698240
And I built 1.1.x after I applied your commit from
Seems he did a local patch
Yes, I built Pekko myself locally after I applied this patch to 1.1.x locally, and built Flink against that version with the added setting to run the test.
@raboof @nvollmar What do you think? This looks not too bad for a patch - not ideal to have a non-bug fix in a patch but the change is small enough and doesn't have an API impact. The default behaviour remains unchanged and you have to opt in via a config change to see any behaviour change.
PR looks fine to me but I would prefer to have it released in a non patch update
@mdedetrich it's already in the main, here is to decide if we include it in 1.1.4
There is an option to do a 1.2.0-M1 release soon.
Then let's do a 1.20 release, we can backport strictly fixes from main to 1.1.x branch if needed.
I'm not sure if I can have all my features ready in 1.2.0, Chinese Spring Festival is coming. I still have some features in mind but not yet there.
I'm not sure if I can have all my features ready in 1.2.0, Chinese Spring Festival is coming. I still have some features in mind but not yet there.
I'm thinking 1.2.0-M1, 1.2.0-M2, maybe some more milestones and later this year, a full 1.2.0 release.
Flink team should be fine using 1.2.0-M1 in production if we don't add any extra changes to what is already on main branch.
I think pekko's M1 release should be very stable too, @afedulov @ferenc-csaky wdyt about it.
Does the 1.2 branch includes a Netty version update compared to 1.1.x? Cause if yes, it might be not as straightforward for Flink to jump to it.
Does the
1.2branch includes a Netty version update compared to1.1.x? Cause if yes, it might be not as straightforward for Flink to jump to it.
netty is a provided dependency of pekko-remote - so choose your own patch version of netty as long it is 4.1.x.
@pjfanning in that case it should be no problem on our end to wait for 1.2, thanks! 👍
@He-Pin @pjfanning @mdedetrich First of all, thank you for your support of the Flink project and for attempting to deliver this patch in a timely manner! At this stage, we decided to proceed with upgrading to Netty 4 in our own patch release, accepting the potential increase in memory usage. Without @He-Pin's patch, we are unable to replicate the Netty 3 behavior since the allocator setting is global, and Netty 4 (with its default settings) is already used in the performance-critical data exchange network stack.
Since we are already at the stage of creating release candidates, we cannot wait for the 1.2-M1 releases. However, it would still be valuable to have the 1.2-M1 release available as a fallback option in case we underestimated the impact of the upgrade to Netty 4 and need to provide a timely patch on our side.
@afedulov To be clear, I did not mean to bump Pekko for the currently in-flight patch releases, but for the next one against Flink 1.20. By that time, I am sure Pekko 1.2 will be released properly. :)
@pjfanning Do you think we should have this in 1.1.4? seems we need more time to publish 1.2.0
I don't think this should go into 1.1.4.
As Pekko 1.2.0 is out, so I can close this now. @ferenc-csaky ping, Pekko 1.2.0 is released.
@He-Pin Thank you for the reminder!
@ferenc-csaky, btw, will Flink move to Java 17+ later is there any timeline?