components-contrib icon indicating copy to clipboard operation
components-contrib copied to clipboard

Azure EventGrid bindings conformance test fails in ~15% of runs

Open CodeMonkeyLeet opened this issue 3 years ago • 6 comments

Expected Behavior

Azure EventGrid bindings conformance test should pass reliably.

Actual Behavior

The EventGrid bindings tests continue to fail about 15% of the time in automation, which is the highest of any conformance test.

=== RUN   TestBindingsConformance/azure.eventgrid/verify_read
time="2021-11-05T19:05:57.918859725Z" level=info msg="Verify Read test running ..." instance=fv-az209-699 scope=bindingsTest type=log ver=unknown
    assertion_compare.go:313: 
        	Error Trace:	bindings.go:233
        	Error:      	"0" is not greater than "0"
        	Test:       	TestBindingsConformance/azure.eventgrid/verify_read
        	Messages:   	[]
time="2021-11-05T19:09:57.919209065Z" level=info msg="Read timeout." instance=fv-az209-699 scope=bindingsTest type=log ver=unknown

This is a continuation of failure signature seen in https://github.com/dapr/components-contrib/issues/1003, where the verify_read test is failing because it times out waiting on the test Read() handler to be invoked for message published to EventGrid.

  • The attempted fix https://github.com/dapr/components-contrib/pull/1050 does not address the failure completely.
    • The continuing failures show that all invocations to publish happen seconds after the http server for Read() has been started.
    • Decreasing the wait to < 1s increases the rate of failure as publishing before read is ready means that the delivery fails.
    • Increasing the wait beyond the current 8s offers no observable failure rate reduction, indicating that the continuing failures are unlikely due to sequencing of test actions.

It looks like the failure to receive the published test message may be due to its interaction with ngrok:

  • Per Dapr docs on local testing of EventGrid bindings, the EventGrid conformance test uses ngrok to provide public HTTP ingress to the local server created by the input binding. The additional interaction may be involved in a race.
    • Some ngrok logs from the failures includes connection warnings, but nothing conclusive.
    • It's unusual that most passes have two connection joins from ngrok to the local server and failures never do, but it's not a requirement for successful runs either. Given that the test only runs a single subscribe, it's also not clear what the two connections would be for.
  • The failure is not in Invoke() publishing to EventGrid: looking at the history of the EventGrid service shows a tight correlation between logged failed message deliveries from EventGrid and the EventGrid conformance test instances. (i.e. EventGrid received the test message, but could not deliver it to the subscription)
  • EventGrid does have a default retry policy but there's no indication that any retries have been attempted. This may be due to EventGrid receiving a 400 range status from the webhook, which could be due to the ngrok endpoint not being joined to the component when the delivery is attempted.

Steps to Reproduce the Problem

Run the conformance.yml GitHub workflow repeatedly.

Release Note

RELEASE NOTE: FIX Azure EventGrid bindings conformance test flakiness

CodeMonkeyLeet avatar Nov 05 '21 21:11 CodeMonkeyLeet

Event Grid diagnostic logs on failed deliveries confirms that Event Grid is getting 404 (not found) from the ngrok endpoint in the failure cases, hence no retries from Event Grid. For example:

{
    "time": "2021-11-10T21:05:57.3587525Z",
    "resourceId": "/SUBSCRIPTIONS/38875A89-0178-4F27-A141-DC6FC01F183D/RESOURCEGROUPS/DAPR2-CONF-TEST-RG/PROVIDERS/MICROSOFT.EVENTGRID/TOPICS/DAPR2-CONF-TEST-EVENTGRID-TOPIC",
    "eventSubscriptionName": "CONF-TEST-SUBSCRIPTION",
    "category": "DeliveryFailures",
    "operationName": "Deliver",
    "message": "outcome=NotFound, latencyInMs=324, id=04fcab9d-fd31-4fec-a0cd-d6b03b139799, outputEventSystemId=762da8f3-5be2-40e8-86dd-d2ba2b9dde25, state=FilteredFailingDelivery, deliveryTime=11/10/2021 9:05:57 PM, deliveryCount=0, probationCount=0, deliverySchema=CloudEventV10, eventSubscriptionDeliverySchema=CloudEventV10, outputEventFields=InputEvent| EventSubscriptionId| DeliveryTime| State| Id| LastHttpStatusCode| DeliverySchema| LastDeliveryAttemptTime| TransformedDeliverySchema| SystemId, outputEventFieldCount=, requestExpiration=1/1/0001 12:00:00 AM, delivered=False id=A234-1234-1234, inputEventSystemId=4d7cae23-3025-44b6-ba28-aa02ecbd57c5 publishTime=11/10/2021 9:05:57 PM, eventTime=1/25/2018 10:12:19 PM, eventType=test, deliveryTime=11/10/2021 9:05:57 PM, filteringState=FilteredWithRpc, inputSchema=EventGridEvent, publisher=DAPR2-CONF-TEST-EVENTGRID-TOPIC.WESTUS-1.EVENTGRID.AZURE.NET, size=322, subject=dapr-conf-tests, inputEventFields=Id| PublishTime| SerializedBody| EventType| Topic| Subject| FilteringHashCode| SystemId| Publisher| FilteringTopic| TopicCategory| DataVersion| MetadataVersion| InputSchema| EventTime| FilteringPolicy, inputEventFieldCount=, type=HttpPost, subType=Others, supportsBatching=False, aadIntegration=False, managedIdentityType=None, urlPath=https://d1e7-20-114-37-175.ngrok.io/api/events, deliveryResponse=NotFound: HttpRequestMessage: httpVersion=1.1, HttpResponseMessage: HttpVersion=1.1, StatusCode=NotFound(NotFound), StatusDescription=Not Found, ConnectionInfo=defaultConnectionLimit=8192, reusePortSupported=True, reusePort=True: "
}

While ngrok has seen outages like this, the pattern of failures seen in the conformance test history is not correlated with the reported ngrok outages.

CodeMonkeyLeet avatar Nov 10 '21 22:11 CodeMonkeyLeet

The additional logging confirms that the EventGrid service is somehow sending to the message to the wrong URI:

From a failed delivery event log:

{
  "time": "2021-11-19T23:46:46.6945198Z",
  "resourceId": "/SUBSCRIPTIONS/38875A89-0178-4F27-A141-DC6FC01F183D/RESOURCEGROUPS/DAPR2-CONF-TEST-RG/PROVIDERS/MICROSOFT.EVENTGRID/TOPICS/DAPR2-CONF-TEST-EVENTGRID-TOPIC",
  "eventSubscriptionName": "CONF-TEST-SUBSCRIPTION",
  "category": "DeliveryFailures",
  "operationName": "Deliver",
  "message": "outcome=NotFound, latencyInMs=320, id=46380374-1bde-4132-b81b-d66bd7c7148f, outputEventSystemId=a5e92ca0-b8d1-4b2e-b5e5-9b4c89a3f8e4, state=FilteredFailingDelivery, deliveryTime=11/19/2021 11:46:46 PM, deliveryCount=0, probationCount=0, deliverySchema=CloudEventV10, eventSubscriptionDeliverySchema=CloudEventV10, outputEventFields=InputEvent| EventSubscriptionId| DeliveryTime| State| Id| LastHttpStatusCode| DeliverySchema| LastDeliveryAttemptTime| TransformedDeliverySchema| SystemId, outputEventFieldCount=, requestExpiration=1/1/0001 12:00:00 AM, delivered=False id=A234-1234-1234, inputEventSystemId=327a3ba0-f5a9-4bca-89d4-53fdb69b7cb2 publishTime=11/19/2021 11:46:46 PM, eventTime=1/25/2018 10:12:19 PM, eventType=test, deliveryTime=11/19/2021 11:46:46 PM, filteringState=FilteredWithRpc, inputSchema=EventGridEvent, publisher=DAPR2-CONF-TEST-EVENTGRID-TOPIC.WESTUS-1.EVENTGRID.AZURE.NET, size=322, subject=dapr-conf-tests, inputEventFields=Id| PublishTime| SerializedBody| EventType| Topic| Subject| FilteringHashCode| SystemId| Publisher| FilteringTopic| TopicCategory| DataVersion| MetadataVersion| InputSchema| EventTime| FilteringPolicy, inputEventFieldCount=, type=HttpPost, subType=Others, supportsBatching=False, aadIntegration=False, managedIdentityType=None, urlPath=https://565e-20-124-206-162.ngrok.io/api/events, deliveryResponse=NotFound: HttpRequestMessage: httpVersion=1.1, HttpResponseMessage: HttpVersion=1.1, StatusCode=NotFound(NotFound), StatusDescription=Not Found, ConnectionInfo=defaultConnectionLimit=8192, reusePortSupported=True, reusePort=True: "
}

The urlPath is https://565e-20-124-206-162.ngrok.io/api/events, however, the test expected that it created the subscription with https://2259-20-122-43-178.ngrok.io/api/events:

time="2021-11-19T23:46:38.06095096Z" level=debug msg="Attempting to create or update Event Grid subscription. scope=*** endpointURL=https://2259-20-122-43-178.ngrok.io/api/events" instance=fv-az133-304 scope=testLogger type=log ver=unknown
time="2021-11-19T23:46:39.608373861Z" level=debug msg="Succeeded to create or update Event Grid subscription. scope=*** endpointURL=https://2259-20-122-43-178.ngrok.io/api/events" instance=fv-az133-304 scope=testLogger type=log ver=unknown
time="2021-11-19T23:46:39.608427161Z" level=debug msg="About to start listening for Event Grid events at http://localhost:9000/api/events" instance=fv-az133-304 scope=testLogger type=log ver=unknown
=== RUN   TestBindingsConformance/azure.eventgrid/create
time="2021-11-19T23:46:46.061614595Z" level=info msg="Create test running ..." instance=fv-az133-304 scope=bindingsTest type=log ver=unknown
time="2021-11-19T23:46:46.399376403Z" level=debug msg="Successfully posted event to ***" instance=fv-az133-304 scope=testLogger type=log ver=unknown

The extended ngrok logs confirm that https://2259-20-122-43-178.ngrok.io/api/events is the established tunnel to the local instance, it successfully processed the OPTIONS call to that endpoint, and that the local instance did not receive any POST messages from EventGrid.

It turns out that the incorrect URL comes from the previous run of the Event Grid test half an hour prior. This suggests that the event grid binding returns success when failing to overwrite an existing subscription, which is unexpected.

CodeMonkeyLeet avatar Nov 20 '21 02:11 CodeMonkeyLeet

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged (pinned, good first issue, help wanted or triaged/resolved) or other activity occurs. Thank you for your contributions.

dapr-bot avatar Dec 20 '21 02:12 dapr-bot

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as pinned, good first issue, help wanted or triaged/resolved. Thank you for your contributions.

dapr-bot avatar Dec 27 '21 02:12 dapr-bot

This will be needed before we are able to work towards making Event Grid as stable. Will have a look into this issue.

DeepanshuA avatar Jul 21 '22 08:07 DeepanshuA

/assign

DeepanshuA avatar Jul 21 '22 08:07 DeepanshuA

I tried to look into almost every failure in last 2 weeks Components Conformance Tests from https://github.com/dapr/components-contrib/actions/workflows/conformance.yml and then, various random failures in last 4 months runs - but, couldn't find failure corresponding to [bindings.azure.eventgrid conformance.

I believe in such case, it should be fine to close this issue.

DeepanshuA avatar Dec 26 '22 14:12 DeepanshuA

Agreed @DeepanshuA. This issue is no longer relevant.

berndverst avatar Jan 18 '23 21:01 berndverst