incubator-heron
incubator-heron copied to clipboard
Revamp RoundRobinPacking default container size determination
Changes:
- When instance resource requirement map is partially set, those without a specific resource allocation will no longer divide the remaining container resource. Instead, they will be allocated with default instance resource (1 core, 1 GB RAM)
- When the container RAM/CPU constraints are NOT specified, default values will be determined as of the largest container.
@xiaoyao1991 Does this mean that if I created a simple topology with lets say 3 components with parallelism of 2 each without explicitly specifying resources that the topology would default to requiring 6 cores to run?
@joshfischer1108 Yes
@xiaoyao1991 I wonder how this would affect adoption and learning of how to use the system? As a new user of system, I would like to be able to run a simple topology without adding much configs to understand how things work without needing to provision 6 cores for a single topology. How would this affect running topologies locally?
I also wonder how this would affect current users of system that haven't explicitly defined resource constraints for components in a topology? Would this be considered a breaking change?
@joshfischer1108 That is true. This puts more burdens on users' knowledge on configuring properly, which is bad. Over the past few months, I've been trying to modify the round robin packing algorithm. In the previous releases (not the latest one), the algorithm basically ignores container-level constraints if all resources required by components exceed the container resource, which might sound flexible, but can be risky if the scheduler (say Aurora) denies the packing plan due to insufficient resource quota.
I've made modifications to it so that container-level constraints must be honored if set, but I've seen users complaining about their previously-worked topologies suddenly failing to submit... Trying to find a balance here.
What do you think if we reduce the instancedefault cpu to a smaller number, say 0.25?
@xiaoyao1991 If it were me, I would like to see a topology fail to submit with an explicit error message giving detail on what reason for submission failure is.. That seems better than the alternative of launching a topology that ignores the container resource. 0.25 is definitely more reasonable than 1 core per component, but still may be bit high for a default. I remember when I was first getting into Heron and I spun up a topology that had 6 different components each with different parallelism to show the capabilities of the system. What do you think about 0.10 per component?
I guess another alternative to not lowering below 0.25 is to find a way to alert the user that the topology configs are too much for the available resources.
@joshfischer1108 Warning and failing the submission was exactly what it behaved currently (maybe the warning message should include more debugging information on submission failures). But I got a lot of complaints from customers saying their jobs fail to submit because the previous version is too loose on resource constraints... We are essentially moving from an extremely loose packing policy to a relatively tight one. I'm good with 0.10 per container. @nwangtw thoughts?
Just wanted to revisit the merge. Can we merge this? Has this functionality been added to the default RoundRobinPacking algorithm? Should this implementation have a different name? Or is it truly a V2 of the previous RoundRobinPacking class?
I also like the idea of 0.1 per component.
I think supporting partial cpu can be useful for k8s cluster and local host so I am all for it. Could be tricky to test all the cases, but we can solve them when things come up I guess.
@xiaoyao1991 should we merge this PR or drop it?
It's a standalone class. This merge shouldn't impact any of the existing functionality. I've decided to merge it so that the code is present in the main branch. If anyone wants to use or test it, they can update their Heron configs to reference the newer Packing class.