deploykit
deploykit copied to clipboard
Group controller always calling instance plugin with "properties=true"
The group SPI has a DescribeGroup
:
https://github.com/docker/infrakit/blob/master/pkg/spi/group/spi.go#L45
The group controller then called the instance plugin's DescribeInstances
: https://github.com/docker/infrakit/blob/master/pkg/plugin/group/scaled.go#L142
Note that the DescribeInstances
is hardcoded to properties=true
. Depending on how the instance plugin is implemented, this can cause a lot of processing. When the group controller does it's normal processing (to determine the group is of the appropriate size) these additional attributes are not even used. Therefore, most of the group <-> instance communications is asking for additional properties that are never used.
Note that when the enrollment and ingress controllers retrieve the group members, we do need the additional properties.
Can we somehow update the SPI to expose properties
boolean? This would reduce the processing done in the instance plugin.
Note, for the terraform provider, every DescribeInstances
API with properties=true
results in a Command.Exec
to terraform show
; my investigations show that this does not scale. Our deployment has 3 groups, 1 enrollment controller, and 1 ingress controller (managing 3 L4 load balancers) -- this results in dozens of DescribeInstances
invocations per minute. FWIW, I'm working on caching the instances to help mitigate this issue; however, IMO, the SPI should expose the properties
boolean.
Thoughts on what should be done to the SPI?
How about this..
We can create a wrapper instance plugin that implements the DescribeInstance
method by returning cached result up a TTL. If the TTL expires then it does a real query by delegating to the wrapped instance plugin's DescribeInstances
. We'd always query with properties=true
. Go can support this type of mixin behavior via composition / embedding pretty easily and we would only have to implement one function. This could be a quick fix I'd try to see if that improve things.
@chungers Sorry, I don't understand how this will help reduce the load that is incurred by including properties=true
?
Are you suggesting that we implement a cache layer for every instance plugin? If so, we need to be careful about how'd we invalidate that cache. For example, a Describeinstances
right after a Provision
will need to return that newly provisioned instance -- same requirement for any Label
or Destroy
operation.
Just realized we'd have to invalidate the cached value whenever state changes. This means we'd have to implement the mutation methods like provision and destroy to forward to the delegated real plugin. When the delegate returns successfully, we need to invalidate the cache result and let the next DescribeInstances do the real query. What do you think?
@chungers Yes, that's what I'm thinking. But we also need to think about filtering. The describe API accepts a tag
map that it uses to filter. If we do this solution wouldn't we want to cache "all" instances in the wrapper, and then apply the tag filter to the cached instances? If so, we need to think about how the instance plugin would implement a describe without a tag -- we don't want to return all instances in the cloud provider account. Maybe we'd use something like the swarm ID since that'd be unique and apply to all groups?