service-fabric
service-fabric copied to clipboard
[BUG] GetReplicas API method does not work with request drain
Describe the bug Service Fabric implements request drain feature, meant to "Avoid connection drops during stateless service planned downtime" (https://docs.microsoft.com/en-us/azure/service-fabric/service-fabric-application-upgrade-advanced#avoid-connection-drops-during-stateless-service-planned-downtime). It is done by stopping advertising the endpoint for some time before it's closed e.g. for planned application update.
GetReplicas method (https://docs.microsoft.com/en-us/rest/api/servicefabric/sfclient-api-getreplicainfolist) can be used to get the endpoint data, but it still returns the endpoint during the request drain time when it should no longer be advertised.
This means the endpoint will get full traffic until it goes down and failed requests for a moment after that, instead of being request drained before being removed.
Area/Component: Service Fabric Client API
To Reproduce Steps to reproduce the behavior:
- Set up request drain feature for a container application (https://docs.microsoft.com/en-us/azure/service-fabric/service-fabric-application-upgrade-advanced#avoid-connection-drops-during-stateless-service-planned-downtime, e.g. InstanceCloseDelayDurationSeconds="60")
- Start application update
- While instance close delay duration is ongoing, check the endpoint replicas with Service Fabric Client API, e.g. https://sfclustername.location.cloudapp.azure.com:19080/Partitions/12345678-4b95-4298-b8de-af5e4abe03f9/$/GetReplicas?api-version=6.0
Expected behavior When instance is about to go down in a planned manner, the endpoint is removed from GetReplicas results when instance close delay duration starts.
Observed behavior: GetReplicas returns the endpoint during instance close delay duration, only removes it after the instance goes down.
Service Fabric Runtime Version: 8.1.321.9590
Environment:
- Azure
- OS: Windows 2019
- Version 8.1.321.9590
Does not seem to be regression.
Additional context The GetReplicas method is used by Traefik plugin to get the updated endpoint status. The plugin uses that information to add or remove endpoints where it directs traffic. With request drain in use, the endpoint will stop being advertised before it's removed fully, giving Traefik time to stop directing traffic there. Without request drain, there is a time when Traefik will send requests to the removed endpoint, until Traefik does health check for the endpoint and notices it has gone down, or GetReplicas stops returning it.
ResolvePartition method would give the updated endpoint information during request drain, but it gives stale data unless given parameter which changes every call: https://github.com/Microsoft/service-fabric/issues/80
Relevant Traefik code (see "getInstances"): Current Traefik 1.7 plugin: https://github.com/containous/traefik-extra-service-fabric/blob/master/servicefabric.go Not yet released Traefik 2.x plugin: https://github.com/dariopb/traefikServiceFabricPlugin/blob/c34874d82f063fd78979da38dcbe49967a9c9788/serviceFabricPlugin.go Calling Service Fabric API: https://github.com/jjcollinge/servicefabric/blob/8eebe170fa1ba25d3dfb928b3f86a7313b13b9fe/servicefabric.go
Containers or Traefik do not have the possibility to use the callback function mentioned in request drain documentation, since they aren't built using the Azure Service Fabric libraries.
Assignees: /cc @microsoft/service-fabric-triage