tempo icon indicating copy to clipboard operation
tempo copied to clipboard

Azure DNS Lookup Failures

Open joe-elliott opened this issue 2 years ago • 20 comments

Describe the bug In Azure all components can register DNS errors while trying to work with meta.compacted.json. This issue does not occur in any other environments. Unsure if this is an Azure issue or a Tempo issue. The error itself is a failed tcp connection to a DNS server which suggests this is some issue with Azure infrastructure. However, the fact that the error (almost?) always occurs on meta.compacted.json suggests something about the way we handle that file is different and is causing this issue.

The failures look like:

reading storage container: Head "https://tempoe**************.blob.core.windows.net/tempo/single-tenant/d8aafc48-5796-4221-ac0b-58e001d18515/meta.compacted.json?timeout=61": dial tcp: lookup tempoe**************.blob.core.windows.net on 10.0.0.10:53: dial udp 10.0.0.10:53: operation was canceled

or

error deleting blob, name: single-tenant/*******************/data: Delete "https://tempoe******.blob.core.windows.net/tempo/single-tenant/5b1ab746-fee7-409c-944d-1c1d5ba7a70e/data?timeout=61": dial tcp: lookup tempoe******.blob.core.windows.net on 10.0.0.10:53: dial udp 10.0.0.10:53: operation was canceled

We have seen this issue internally. Also reported here: #1372.

joe-elliott avatar Jun 01 '22 13:06 joe-elliott

Not sure it is related, we also see TempoTenantIndexFailures and TempoTenantIndexTooOld errors, they may caused by this

hwu-talendbj avatar Jun 06 '22 03:06 hwu-talendbj

Yes, both of those can be triggered by this issue. In Azure we would also recommend setting the following value:

storage:
  trace:
    blocklist_poll_tenant_index_builders: 1

This will reduce errors related to tenant index building. Azure does not like it when multiple processes attempt to write to the same object.

joe-elliott avatar Jun 06 '22 13:06 joe-elliott

@joe-elliott thanks, I am trying it

hwu-talendbj avatar Jun 07 '22 07:06 hwu-talendbj

@joe-elliott Hi, I just notice that we are using only 1 tempo compactor, so this blocklist_poll_tenant_index_builders: 1 may not help in this case, or do you think it is better to go with multiple compactors?

hwu-talendbj avatar Jun 07 '22 09:06 hwu-talendbj

Compactors simply reduce the length of the blocklist and help keep query latencies down. If your latencies are fine then your compactors are fine. Two metrics that can help you keep an eye on your blocklist: avg(tempodb_blocklist_length{}) by (tenant) and sum(tempodb_compaction_outstanding_blocks{}) by (tenant)

joe-elliott avatar Jun 07 '22 12:06 joe-elliott

Inside the compcator pod, I test nslookup the azure blob host , it looks good

nslookup tempoe4c***********2b.blob.core.windows.net
Server:		10.0.0.10
Address:	10.0.0.10:53

Non-authoritative answer:
tempoe4c1**********b.blob.core.windows.net	canonical name = blob.ams07prdstr02a.store.core.windows.net
Name:	blob.ams07prdstr02a.store.core.windows.net
Address: 20.150.37.228

Non-authoritative answer:
tempoe4c**********b.blob.core.windows.net	canonical name = blob.ams07prdstr02a.store.core.windows.net

how can I enable the debug mode for Tempo so I could have more log?

hwu-talendbj avatar Jun 15 '22 07:06 hwu-talendbj

You can use the following configuration to enable debug logging on Tempo

server:
    log_level: debug

annanay25 avatar Jun 15 '22 07:06 annanay25

Hello :) After using debug mode with @hwu-talendbj, we found more data. This is what happens during pull/push

{"caller":"memberlist_logger.go:74","level":"error","msg":"Push/Pull with grafana-tempo-distributor-66fbfb5d99-ms794-ad6112e9 failed: dial tcp 10.128.18.112:7946: i/o timeout","ts":"2022-06-21T12:46:32.111904425Z"}
{"addr":"10.128.18.112:7946","caller":"tcp_transport.go:428","component":"memberlist TCPTransport","err":"dial tcp 10.128.18.112:7946: i/o timeout","level":"warn","msg":"WriteTo failed","ts":"2022-06-21T12:46:32.452547619Z"}
{"caller":"memberlist_logger.go:74","level":"debug","msg":"Stream connection from=10.128.20.104:42880","ts":"2022-06-21T12:46:33.866113092Z"}
{"caller":"memberlist_logger.go:74","level":"debug","msg":"Stream connection from=10.128.16.18:51284","ts":"2022-06-21T12:46:36.210169812Z"}
{"caller":"logging.go:67","level":"debug","msg":"GET /metrics (200) 3.38042ms","traceID":"5ce5bffbf7a76751","ts":"2022-06-21T12:46:38.496729396Z"}
{"caller":"logging.go:67","level":"debug","msg":"GET /metrics (200) 3.50652ms","traceID":"4f0e40a70e0652d9","ts":"2022-06-21T12:46:38.510874078Z"}
{"addr":"10.128.18.112:7946","caller":"tcp_transport.go:428","component":"memberlist TCPTransport","err":"dial tcp 10.128.18.112:7946: i/o timeout","level":"warn","msg":"WriteTo failed","ts":"2022-06-21T12:46:39.451709386Z"}
{"caller":"memberlist_logger.go:74","level":"info","msg":"Suspect grafana-tempo-distributor-66fbfb5d99-ms794-ad6112e9 has failed, no acks received","ts":"2022-06-21T12:46:39.454869705Z"}
{"caller":"memberlist_logger.go:74","level":"debug","msg":"Stream connection from=10.128.17.62:55936","ts":"2022-06-21T12:46:39.986359017Z"}
{"addr":"10.128.18.112:7946","caller":"tcp_transport.go:428","component":"memberlist TCPTransport","err":"dial tcp 10.128.18.112:7946: i/o timeout","level":"warn","msg":"WriteTo failed","ts":"2022-06-21T12:46:40.182091363Z"}
{"caller":"memberlist_logger.go:74","level":"debug","msg":"Failed ping: grafana-tempo-distributor-66fbfb5d99-j67d2-a0aa99f0 (timeout reached)","ts":"2022-06-21T12:46:41.455734719Z"}
{"caller":"compactor.go:64","level":"info","msg":"starting compaction cycle","offset":0,"tenantID":"single-tenant","ts":"2022-06-21T12:46:51.476815384Z"}
{"caller":"compactor.go:70","level":"info","msg":"compaction cycle complete. No more blocks to compact","tenantID":"single-tenant","ts":"2022-06-21T12:46:51.476889284Z"}
{"addr":"10.128.18.112:7946","caller":"tcp_transport.go:428","component":"memberlist TCPTransport","err":"dial tcp 10.128.18.112:7946: i/o timeout","level":"warn","msg":"WriteTo failed","ts":"2022-06-21T12:46:52.451737191Z"}
{"caller":"memberlist_logger.go:74","level":"debug","msg":"Failed ping: grafana-tempo-distributor-66fbfb5d99-j67d2-a0aa99f0 (timeout reached)","ts":"2022-06-21T12:46:56.453139815Z"}
{"caller":"memberlist_logger.go:74","level":"debug","msg":"Stream connection from=10.128.18.158:46552","ts":"2022-06-21T12:46:57.18343619Z"}
{"addr":"10.128.18.112:7946","caller":"tcp_transport.go:428","component":"memberlist TCPTransport","err":"dial tcp 10.128.18.112:7946: i/o timeout","level":"warn","msg":"WriteTo failed","ts":"2022-06-21T12:46:57.396034335Z"}
{"caller":"broadcast.go:48","content":"[grafana-tempo-ingester-0]","key":"collectors/ring","level":"debug","msg":"Invalidating forwarded broadcast","oldContent":"[grafana-tempo-ingester-0]","oldVersion":6660,"ts":"2022-06-21T12:46:57.397407443Z","version":6661}

But it should looks like

{"caller":"memberlist_logger.go:74","level":"debug","msg":"Initiating push/pull sync with: grafana-tempo-ingester-0-435a9d15 10.128.18.158:7946","ts":"2022-06-21T12:48:02.11782556Z"}
{"caller":"logging.go:67","level":"debug","msg":"GET /metrics (200) 3.511421ms","traceID":"4aa6e02d1b78e775","ts":"2022-06-21T12:48:08.497212902Z"}
{"caller":"logging.go:67","level":"debug","msg":"GET /metrics (200) 3.194218ms","traceID":"6ab9c753aed0029e","ts":"2022-06-21T12:48:08.510489979Z"}
{"caller":"retention.go:46","level":"debug","msg":"Performing block retention","retention":"48h0m0s","tenantID":"single-tenant","ts":"2022-06-21T12:48:21.476575596Z"}
{"caller":"compactor.go:192","hash":"1b8c524d-a934-451b-a9ca-4ea7db8e3b7e","level":"debug","msg":"checking hash","ts":"2022-06-21T12:48:21.476661197Z"}
{"caller":"compactor.go:211","level":"debug","msg":"checking addresses","owning_addr":"10.128.17.60:0","this_addr":"10.128.17.60:0","ts":"2022-06-21T12:48:21.476687197Z"}
{"blockID":"1b8c524d-a934-451b-a9ca-4ea7db8e3b7e","caller":"retention.go:76","level":"info","msg":"deleting block","tenantID":"single-tenant","ts":"2022-06-21T12:48:21.476701097Z"}
{"caller":"compactor.go:64","level":"info","msg":"starting compaction cycle","offset":0,"tenantID":"single-tenant","ts":"2022-06-21T12:48:21.476772798Z"}
{"caller":"compactor.go:70","level":"info","msg":"compaction cycle complete. No more blocks to compact","tenantID":"single-tenant","ts":"2022-06-21T12:48:21.476833798Z"}

Are you sure this is related to Azure DNS ? :(

galvesLoy avatar Jun 21 '22 13:06 galvesLoy

@joe-elliott Also experiencing same error when running in Azure (AKS and Azure Blob Storage). I tried running with below config for two days, but getting same error.

storage:
  trace:
    blocklist_poll_tenant_index_builders: 1

We are using a private endpoint in Azure to connect to the storage account? Is anyone else using private endpoint to connect to the storage account?

martinjaegerdk avatar Jun 22 '22 08:06 martinjaegerdk

@galvesLoy

Are you sure this is related to Azure DNS ? :(

If you're referring to the memberlist issues in your logs, failures during rollout are normal and should be ignored. Constant memberlist failures suggest networking issues in your cluster. (https://github.com/grafana/tempo/issues/927)

We run multiple clusters on AWS, GCP and Azure. Azure is the only cloud provider on which we see this issue and we see it on all Azure clusters. It is almost always occurs when performing a HEAD request on a meta.compacted.json but rarely it also occurs on a GET request to meta.json. It is likely related to access patterns in Tempo b/c AFAIK other teams internally at Grafana are not seeing this issue.

We have only seen this issue occurring in compactors and only when polling the blocklist. Perhaps something about the rapid fire requests to Azure blob storage during polling is causing an issue?

@martin-jaeger-maersk

That suggested configuration prevents a different error unique to ABS. ABS sometimes throws a strange error when two processes attempt to write the same object at the same time. We were receiving the error here until we reduced blocklist_poll_tenant_index_builders to 1. It is 2 by default.

Following up with the thought that perhaps our default settings are hitting ABS too hard I'm going to experiment with lowering storage.trace.blocklist_poll_concurrency to see if the issue goes away.

joe-elliott avatar Jun 22 '22 18:06 joe-elliott

In a cluster with ~250 blocks I changed storage.trace.blocklist_poll_concurrency from the default of 50 to 10, but we continued to see the issue. Then I lowered storage.trace.blocklist_poll_concurrency to 1 and the issue went away. This technically "fixes" the issue but is likely not useful for larger clusters where the time to complete a polling cycle would go up significantly. I do find it quite odd that the client appears to be making a DNS request every time we access an object, but maybe this is an intentional choice on Azure's part.

I'm going to spend some time trying to upgrade our client. There appear to be 2 official clients and its unclear what the relationship is between them.

joe-elliott avatar Jun 24 '22 13:06 joe-elliott

I do find it quite odd that the client appears to be making a DNS request every time we access an object, but maybe this is an intentional choice on Azure's part.

I agree, that does sound very odd. It can be the azure go-client doing it by intention, or simply a very low TTL on the DNS entry.

Where in the tempo code is the azure client called (in this particular case)

martinjaegerdk avatar Jun 24 '22 19:06 martinjaegerdk

Here's where we begin polling the blocklist concurrently:

https://github.com/grafana/tempo/blob/d8d4dc86ac30f87b88718d452003ec7061a88a52/tempodb/blocklist/poller.go#L186

The actual poll function requests the block meta and, if that can't be found, the compacted block meta:

https://github.com/grafana/tempo/blob/d8d4dc86ac30f87b88718d452003ec7061a88a52/tempodb/blocklist/poller.go#L234

Both BlockMeta() and CompactedBlockMeta() eventually hit this method on the Azure backend:

https://github.com/grafana/tempo/blob/d8d4dc86ac30f87b88718d452003ec7061a88a52/tempodb/backend/azure/azure.go#L144

joe-elliott avatar Jun 24 '22 19:06 joe-elliott

@joe-elliott A few more points, which could make sense, since it's only seen in Azure

DNS request throttle

https://docs.microsoft.com/en-us/azure/azure-resource-manager/management/request-limits-and-throttling

Azure DNS and Azure Private DNS have a throttle limit of 500 read (GET) operations per 5 minutes.

https://docs.microsoft.com/en-us/azure/virtual-network/virtual-networks-name-resolution-for-vms-and-role-instances#considerations

DNS query traffic is throttled for each VM. Throttling shouldn't impact most applications. If request throttling is observed, ensure that client-side caching is enabled. For more information, see DNS client configuration.

DNS race condition when using UDP

Not Azure specific per say, but can be a result of the OS and patch-level of VMs used in AKS.

https://medium.com/asos-techblog/an-aks-performance-journey-part-2-networking-it-out-e253f5bb4f69

Found this article, that talks about all the things that can go wrong with hammering DNS with UDP. One suggestion that he brings up, is to force the container to use TCP over UDP. This adds more overhead, but could result in connections actually being established.

The thesis is that high amounts of DNS requests over UDP, can cause a race condition in the kernel, which can result in UDP requests failing

martinjaegerdk avatar Jun 24 '22 19:06 martinjaegerdk

Using the newly added config in the linked PR and setting the blocklist_poll_jitter_ms value to 500 I was able to go 24 hours without seeing this issue.

#1518

joe-elliott avatar Jun 28 '22 12:06 joe-elliott

@joe-elliott That's very interesting 👍 Does that mean that after 24 hours you saw the error, or simply that you have tested for 24 hours without seeing the error?

martinjaegerdk avatar Jun 28 '22 12:06 martinjaegerdk

I have been testing this change for 24 hours and have not seen the issue. We normally see it 10 - 20 times in a 24 hour period.

joe-elliott avatar Jun 28 '22 13:06 joe-elliott

I have been testing this change for 24 hours and have not seen the issue. We normally see it 10 - 20 times in a 24 hour period.

Right, seems like a valid fix then. I suppose we would have to wait for this to land in a release, unless of course we want to build and publish images on our own.

martinjaegerdk avatar Jun 28 '22 14:06 martinjaegerdk

Every push to main builds a new image: https://hub.docker.com/r/grafana/tempo/tags

You could just update your compactor image if you'd like to leave the rest of the components alone.

joe-elliott avatar Jun 28 '22 14:06 joe-elliott

This seemed to be fixed for awhile, but has recently returned even with the jitter setting referenced above. We are still looking into this.

joe-elliott avatar Aug 11 '22 12:08 joe-elliott

We managed to fix these errors with #1632, and tempo pods DNS ndots=3

In #1632, We increased MaxIdleConns and MaxIdleConnsPerHost for Azure backend to 100 (default is 2)

Testing was done in tempo cluster running in AKS, and configured with Azure backend.

Testing with only https://github.com/grafana/tempo/pull/1632

errors reduced from 4-5 errors per hour to 1 error every 2-3 hours.

testing only with max connectoins

Testing with https://github.com/grafana/tempo/pull/1632 and Pod DNS Config set to ndots=3

No errors

ndots is 3, and maxconnection is 100

Checkout following links to learn more about ndots config, and how to set it in your pods:

electron0zero avatar Aug 22 '22 14:08 electron0zero

See also https://github.com/grafana/jsonnet-libs/pull/762; lowering ndots strictly speaking breaks Kubernetes service discovery, but if you remember not to rely on the specific bit you'll be ok.

[EDIT: was previously linking to non-public issue]

bboreham avatar Aug 23 '22 10:08 bboreham

Another way to solve this without modifying ndots is to add . at then end of hostname to make the hostname fully qualified. If we go with that route, we don't have to lower ndots and risk breaking Kubernetes service discovery

But we can't do that because Azure responds with 400 bad request, and have to go with setting ndots.

Here is what Azure responds with when we add . at the end.

GET https://tempo**************.blob.core.windows.net./**************?restype=container&timeout=61

Status: 400 Bad Request
Content-Length: [334]
Content-Type: [text/html; charset=us-ascii]
Date: [Thu, 11 Aug 2022 19:38:19 GMT]
Server: [Microsoft-HTTPAPI/2.0]

xml: (*azblob.storageError).UnmarshalXML did not consume entire <HTML> element

This looks like a bug, I will file a ticket with Azure.

note: this works fine with curl because curl trims the . in the url, try it with chrome or other tools that doesn't trim .

electron0zero avatar Aug 23 '22 13:08 electron0zero

@electron0zero Hi, thanks for the fix. So after the new image is built, We also need to set tempo pods DNS config ndots=3 for resolving the issue, right? We deploy tempo through Helm chart, not sure if there is a good way to have this setting

hwu-talendbj avatar Aug 30 '22 07:08 hwu-talendbj

In our case, we were seeing these errors only in compactor pods, so we added ndots=3 in compactor pods for Azure.

We did this by setting compactor Pod's DNS Config

spec:
  dnsConfig:
    options:
      - name: ndots
        value: "3"

I am not sure about helm, If you are using the chart from https://github.com/grafana/helm-charts, can you open an issue in that repo please :)

electron0zero avatar Sep 01 '22 10:09 electron0zero

Update from Azure support: Host headers are not allowed to have trailing dot in Azure services, you can use FQDN but need to set Host header without trailing dot.

  • curl -v --header 'Host: *****03.blob.core.windows.net' https://******03.blob.core.windows.net./*******-data?restype=container&timeout=61 -> this is allowed

  • curl -v --header 'Host: *****03.blob.core.windows.net.' https://******03.blob.core.windows.net./*******-data?restype=container&timeout=61 -> this is NOT allowed, and will result in 400 (notice the dot in Host header)

We need to modify Azure client to trim traling dot from Host header for FQDN to work in Azure.

for example: curl trims traling dot from Host header, see https://github.com/curl/curl/commit/3a6142865fec3c54cd7081ada86c93c135e4b32f for details.

Created https://github.com/grafana/tempo/issues/1726 to track this for Azure and other storage backends

electron0zero avatar Sep 08 '22 19:09 electron0zero