deploykit Group controller always calling instance plugin with "properties=true"

The group SPI has a DescribeGroup: https://github.com/docker/infrakit/blob/master/pkg/spi/group/spi.go#L45

The group controller then called the instance plugin's DescribeInstances: https://github.com/docker/infrakit/blob/master/pkg/plugin/group/scaled.go#L142

Note that the DescribeInstances is hardcoded to properties=true. Depending on how the instance plugin is implemented, this can cause a lot of processing. When the group controller does it's normal processing (to determine the group is of the appropriate size) these additional attributes are not even used. Therefore, most of the group <-> instance communications is asking for additional properties that are never used.

Note that when the enrollment and ingress controllers retrieve the group members, we do need the additional properties.

Can we somehow update the SPI to expose properties boolean? This would reduce the processing done in the instance plugin.

Note, for the terraform provider, every DescribeInstances API with properties=true results in a Command.Exec to terraform show; my investigations show that this does not scale. Our deployment has 3 groups, 1 enrollment controller, and 1 ingress controller (managing 3 L4 load balancers) -- this results in dozens of DescribeInstances invocations per minute. FWIW, I'm working on caching the instances to help mitigate this issue; however, IMO, the SPI should expose the properties boolean.

Thoughts on what should be done to the SPI?

Jan 09 '18 21:01 kaufers

How about this..

We can create a wrapper instance plugin that implements the DescribeInstance method by returning cached result up a TTL. If the TTL expires then it does a real query by delegating to the wrapped instance plugin's DescribeInstances. We'd always query with properties=true. Go can support this type of mixin behavior via composition / embedding pretty easily and we would only have to implement one function. This could be a quick fix I'd try to see if that improve things.

Jan 10 '18 16:01 chungers

@chungers Sorry, I don't understand how this will help reduce the load that is incurred by including properties=true?

Are you suggesting that we implement a cache layer for every instance plugin? If so, we need to be careful about how'd we invalidate that cache. For example, a Describeinstances right after a Provision will need to return that newly provisioned instance -- same requirement for any Label or Destroy operation.

Jan 10 '18 17:01 kaufers

Just realized we'd have to invalidate the cached value whenever state changes. This means we'd have to implement the mutation methods like provision and destroy to forward to the delegated real plugin. When the delegate returns successfully, we need to invalidate the cache result and let the next DescribeInstances do the real query. What do you think?

Jan 10 '18 17:01 chungers

@chungers Yes, that's what I'm thinking. But we also need to think about filtering. The describe API accepts a tag map that it uses to filter. If we do this solution wouldn't we want to cache "all" instances in the wrapper, and then apply the tag filter to the cached instances? If so, we need to think about how the instance plugin would implement a describe without a tag -- we don't want to return all instances in the cloud provider account. Maybe we'd use something like the swarm ID since that'd be unique and apply to all groups?

Jan 10 '18 18:01 kaufers

deploykit deploykit copied to clipboard

Group controller always calling instance plugin with "properties=true"

deploykit
deploykit copied to clipboard