tempo
tempo copied to clipboard
Azure DNS Lookup Failures
Describe the bug
In Azure all components can register DNS errors while trying to work with meta.compacted.json. This issue does not occur in any other environments. Unsure if this is an Azure issue or a Tempo issue. The error itself is a failed tcp connection to a DNS server which suggests this is some issue with Azure infrastructure. However, the fact that the error (almost?) always occurs on meta.compacted.json
suggests something about the way we handle that file is different and is causing this issue.
The failures look like:
reading storage container: Head "https://tempoe**************.blob.core.windows.net/tempo/single-tenant/d8aafc48-5796-4221-ac0b-58e001d18515/meta.compacted.json?timeout=61": dial tcp: lookup tempoe**************.blob.core.windows.net on 10.0.0.10:53: dial udp 10.0.0.10:53: operation was canceled
or
error deleting blob, name: single-tenant/*******************/data: Delete "https://tempoe******.blob.core.windows.net/tempo/single-tenant/5b1ab746-fee7-409c-944d-1c1d5ba7a70e/data?timeout=61": dial tcp: lookup tempoe******.blob.core.windows.net on 10.0.0.10:53: dial udp 10.0.0.10:53: operation was canceled
We have seen this issue internally. Also reported here: #1372.
Not sure it is related, we also see TempoTenantIndexFailures
and TempoTenantIndexTooOld
errors, they may caused by this
Yes, both of those can be triggered by this issue. In Azure we would also recommend setting the following value:
storage:
trace:
blocklist_poll_tenant_index_builders: 1
This will reduce errors related to tenant index building. Azure does not like it when multiple processes attempt to write to the same object.
@joe-elliott thanks, I am trying it
@joe-elliott Hi, I just notice that we are using only 1 tempo compactor, so this blocklist_poll_tenant_index_builders: 1
may not help in this case, or do you think it is better to go with multiple compactors?
Compactors simply reduce the length of the blocklist and help keep query latencies down. If your latencies are fine then your compactors are fine. Two metrics that can help you keep an eye on your blocklist:
avg(tempodb_blocklist_length{}) by (tenant)
and
sum(tempodb_compaction_outstanding_blocks{}) by (tenant)
Inside the compcator pod, I test nslookup the azure blob host , it looks good
nslookup tempoe4c***********2b.blob.core.windows.net
Server: 10.0.0.10
Address: 10.0.0.10:53
Non-authoritative answer:
tempoe4c1**********b.blob.core.windows.net canonical name = blob.ams07prdstr02a.store.core.windows.net
Name: blob.ams07prdstr02a.store.core.windows.net
Address: 20.150.37.228
Non-authoritative answer:
tempoe4c**********b.blob.core.windows.net canonical name = blob.ams07prdstr02a.store.core.windows.net
how can I enable the debug mode for Tempo so I could have more log?
You can use the following configuration to enable debug logging on Tempo
server:
log_level: debug
Hello :) After using debug mode with @hwu-talendbj, we found more data. This is what happens during pull/push
{"caller":"memberlist_logger.go:74","level":"error","msg":"Push/Pull with grafana-tempo-distributor-66fbfb5d99-ms794-ad6112e9 failed: dial tcp 10.128.18.112:7946: i/o timeout","ts":"2022-06-21T12:46:32.111904425Z"}
{"addr":"10.128.18.112:7946","caller":"tcp_transport.go:428","component":"memberlist TCPTransport","err":"dial tcp 10.128.18.112:7946: i/o timeout","level":"warn","msg":"WriteTo failed","ts":"2022-06-21T12:46:32.452547619Z"}
{"caller":"memberlist_logger.go:74","level":"debug","msg":"Stream connection from=10.128.20.104:42880","ts":"2022-06-21T12:46:33.866113092Z"}
{"caller":"memberlist_logger.go:74","level":"debug","msg":"Stream connection from=10.128.16.18:51284","ts":"2022-06-21T12:46:36.210169812Z"}
{"caller":"logging.go:67","level":"debug","msg":"GET /metrics (200) 3.38042ms","traceID":"5ce5bffbf7a76751","ts":"2022-06-21T12:46:38.496729396Z"}
{"caller":"logging.go:67","level":"debug","msg":"GET /metrics (200) 3.50652ms","traceID":"4f0e40a70e0652d9","ts":"2022-06-21T12:46:38.510874078Z"}
{"addr":"10.128.18.112:7946","caller":"tcp_transport.go:428","component":"memberlist TCPTransport","err":"dial tcp 10.128.18.112:7946: i/o timeout","level":"warn","msg":"WriteTo failed","ts":"2022-06-21T12:46:39.451709386Z"}
{"caller":"memberlist_logger.go:74","level":"info","msg":"Suspect grafana-tempo-distributor-66fbfb5d99-ms794-ad6112e9 has failed, no acks received","ts":"2022-06-21T12:46:39.454869705Z"}
{"caller":"memberlist_logger.go:74","level":"debug","msg":"Stream connection from=10.128.17.62:55936","ts":"2022-06-21T12:46:39.986359017Z"}
{"addr":"10.128.18.112:7946","caller":"tcp_transport.go:428","component":"memberlist TCPTransport","err":"dial tcp 10.128.18.112:7946: i/o timeout","level":"warn","msg":"WriteTo failed","ts":"2022-06-21T12:46:40.182091363Z"}
{"caller":"memberlist_logger.go:74","level":"debug","msg":"Failed ping: grafana-tempo-distributor-66fbfb5d99-j67d2-a0aa99f0 (timeout reached)","ts":"2022-06-21T12:46:41.455734719Z"}
{"caller":"compactor.go:64","level":"info","msg":"starting compaction cycle","offset":0,"tenantID":"single-tenant","ts":"2022-06-21T12:46:51.476815384Z"}
{"caller":"compactor.go:70","level":"info","msg":"compaction cycle complete. No more blocks to compact","tenantID":"single-tenant","ts":"2022-06-21T12:46:51.476889284Z"}
{"addr":"10.128.18.112:7946","caller":"tcp_transport.go:428","component":"memberlist TCPTransport","err":"dial tcp 10.128.18.112:7946: i/o timeout","level":"warn","msg":"WriteTo failed","ts":"2022-06-21T12:46:52.451737191Z"}
{"caller":"memberlist_logger.go:74","level":"debug","msg":"Failed ping: grafana-tempo-distributor-66fbfb5d99-j67d2-a0aa99f0 (timeout reached)","ts":"2022-06-21T12:46:56.453139815Z"}
{"caller":"memberlist_logger.go:74","level":"debug","msg":"Stream connection from=10.128.18.158:46552","ts":"2022-06-21T12:46:57.18343619Z"}
{"addr":"10.128.18.112:7946","caller":"tcp_transport.go:428","component":"memberlist TCPTransport","err":"dial tcp 10.128.18.112:7946: i/o timeout","level":"warn","msg":"WriteTo failed","ts":"2022-06-21T12:46:57.396034335Z"}
{"caller":"broadcast.go:48","content":"[grafana-tempo-ingester-0]","key":"collectors/ring","level":"debug","msg":"Invalidating forwarded broadcast","oldContent":"[grafana-tempo-ingester-0]","oldVersion":6660,"ts":"2022-06-21T12:46:57.397407443Z","version":6661}
But it should looks like
{"caller":"memberlist_logger.go:74","level":"debug","msg":"Initiating push/pull sync with: grafana-tempo-ingester-0-435a9d15 10.128.18.158:7946","ts":"2022-06-21T12:48:02.11782556Z"}
{"caller":"logging.go:67","level":"debug","msg":"GET /metrics (200) 3.511421ms","traceID":"4aa6e02d1b78e775","ts":"2022-06-21T12:48:08.497212902Z"}
{"caller":"logging.go:67","level":"debug","msg":"GET /metrics (200) 3.194218ms","traceID":"6ab9c753aed0029e","ts":"2022-06-21T12:48:08.510489979Z"}
{"caller":"retention.go:46","level":"debug","msg":"Performing block retention","retention":"48h0m0s","tenantID":"single-tenant","ts":"2022-06-21T12:48:21.476575596Z"}
{"caller":"compactor.go:192","hash":"1b8c524d-a934-451b-a9ca-4ea7db8e3b7e","level":"debug","msg":"checking hash","ts":"2022-06-21T12:48:21.476661197Z"}
{"caller":"compactor.go:211","level":"debug","msg":"checking addresses","owning_addr":"10.128.17.60:0","this_addr":"10.128.17.60:0","ts":"2022-06-21T12:48:21.476687197Z"}
{"blockID":"1b8c524d-a934-451b-a9ca-4ea7db8e3b7e","caller":"retention.go:76","level":"info","msg":"deleting block","tenantID":"single-tenant","ts":"2022-06-21T12:48:21.476701097Z"}
{"caller":"compactor.go:64","level":"info","msg":"starting compaction cycle","offset":0,"tenantID":"single-tenant","ts":"2022-06-21T12:48:21.476772798Z"}
{"caller":"compactor.go:70","level":"info","msg":"compaction cycle complete. No more blocks to compact","tenantID":"single-tenant","ts":"2022-06-21T12:48:21.476833798Z"}
Are you sure this is related to Azure DNS ? :(
@joe-elliott Also experiencing same error when running in Azure (AKS and Azure Blob Storage). I tried running with below config for two days, but getting same error.
storage:
trace:
blocklist_poll_tenant_index_builders: 1
We are using a private endpoint in Azure to connect to the storage account? Is anyone else using private endpoint to connect to the storage account?
@galvesLoy
Are you sure this is related to Azure DNS ? :(
If you're referring to the memberlist issues in your logs, failures during rollout are normal and should be ignored. Constant memberlist failures suggest networking issues in your cluster. (https://github.com/grafana/tempo/issues/927)
We run multiple clusters on AWS, GCP and Azure. Azure is the only cloud provider on which we see this issue and we see it on all Azure clusters. It is almost always occurs when performing a HEAD
request on a meta.compacted.json
but rarely it also occurs on a GET
request to meta.json
. It is likely related to access patterns in Tempo b/c AFAIK other teams internally at Grafana are not seeing this issue.
We have only seen this issue occurring in compactors and only when polling the blocklist. Perhaps something about the rapid fire requests to Azure blob storage during polling is causing an issue?
@martin-jaeger-maersk
That suggested configuration prevents a different error unique to ABS. ABS sometimes throws a strange error when two processes attempt to write the same object at the same time. We were receiving the error here until we reduced blocklist_poll_tenant_index_builders
to 1. It is 2 by default.
Following up with the thought that perhaps our default settings are hitting ABS too hard I'm going to experiment with lowering storage.trace.blocklist_poll_concurrency
to see if the issue goes away.
In a cluster with ~250 blocks I changed storage.trace.blocklist_poll_concurrency
from the default of 50 to 10, but we continued to see the issue. Then I lowered storage.trace.blocklist_poll_concurrency
to 1 and the issue went away. This technically "fixes" the issue but is likely not useful for larger clusters where the time to complete a polling cycle would go up significantly. I do find it quite odd that the client appears to be making a DNS request every time we access an object, but maybe this is an intentional choice on Azure's part.
I'm going to spend some time trying to upgrade our client. There appear to be 2 official clients and its unclear what the relationship is between them.
I do find it quite odd that the client appears to be making a DNS request every time we access an object, but maybe this is an intentional choice on Azure's part.
I agree, that does sound very odd. It can be the azure go-client doing it by intention, or simply a very low TTL on the DNS entry.
Where in the tempo code is the azure client called (in this particular case)
Here's where we begin polling the blocklist concurrently:
https://github.com/grafana/tempo/blob/d8d4dc86ac30f87b88718d452003ec7061a88a52/tempodb/blocklist/poller.go#L186
The actual poll function requests the block meta and, if that can't be found, the compacted block meta:
https://github.com/grafana/tempo/blob/d8d4dc86ac30f87b88718d452003ec7061a88a52/tempodb/blocklist/poller.go#L234
Both BlockMeta()
and CompactedBlockMeta()
eventually hit this method on the Azure backend:
https://github.com/grafana/tempo/blob/d8d4dc86ac30f87b88718d452003ec7061a88a52/tempodb/backend/azure/azure.go#L144
@joe-elliott A few more points, which could make sense, since it's only seen in Azure
DNS request throttle
https://docs.microsoft.com/en-us/azure/azure-resource-manager/management/request-limits-and-throttling
Azure DNS and Azure Private DNS have a throttle limit of 500 read (GET) operations per 5 minutes.
https://docs.microsoft.com/en-us/azure/virtual-network/virtual-networks-name-resolution-for-vms-and-role-instances#considerations
DNS query traffic is throttled for each VM. Throttling shouldn't impact most applications. If request throttling is observed, ensure that client-side caching is enabled. For more information, see DNS client configuration.
DNS race condition when using UDP
Not Azure specific per say, but can be a result of the OS and patch-level of VMs used in AKS.
https://medium.com/asos-techblog/an-aks-performance-journey-part-2-networking-it-out-e253f5bb4f69
Found this article, that talks about all the things that can go wrong with hammering DNS with UDP. One suggestion that he brings up, is to force the container to use TCP over UDP. This adds more overhead, but could result in connections actually being established.
The thesis is that high amounts of DNS requests over UDP, can cause a race condition in the kernel, which can result in UDP requests failing
Using the newly added config in the linked PR and setting the blocklist_poll_jitter_ms
value to 500 I was able to go 24 hours without seeing this issue.
#1518
@joe-elliott That's very interesting 👍 Does that mean that after 24 hours you saw the error, or simply that you have tested for 24 hours without seeing the error?
I have been testing this change for 24 hours and have not seen the issue. We normally see it 10 - 20 times in a 24 hour period.
I have been testing this change for 24 hours and have not seen the issue. We normally see it 10 - 20 times in a 24 hour period.
Right, seems like a valid fix then. I suppose we would have to wait for this to land in a release, unless of course we want to build and publish images on our own.
Every push to main builds a new image: https://hub.docker.com/r/grafana/tempo/tags
You could just update your compactor image if you'd like to leave the rest of the components alone.
This seemed to be fixed for awhile, but has recently returned even with the jitter setting referenced above. We are still looking into this.
We managed to fix these errors with #1632, and tempo pods DNS ndots=3
In #1632, We increased MaxIdleConns and MaxIdleConnsPerHost for Azure backend to 100 (default is 2)
Testing was done in tempo cluster running in AKS, and configured with Azure backend.
Testing with only https://github.com/grafana/tempo/pull/1632
errors reduced from 4-5 errors per hour to 1 error every 2-3 hours.
Testing with https://github.com/grafana/tempo/pull/1632 and Pod DNS Config set to ndots=3
No errors
Checkout following links to learn more about ndots config, and how to set it in your pods:
See also https://github.com/grafana/jsonnet-libs/pull/762; lowering ndots
strictly speaking breaks Kubernetes service discovery, but if you remember not to rely on the specific bit you'll be ok.
[EDIT: was previously linking to non-public issue]
Another way to solve this without modifying ndots
is to add .
at then end of hostname to make the hostname fully qualified. If we go with that route, we don't have to lower ndots
and risk breaking Kubernetes service discovery
But we can't do that because Azure responds with 400 bad request
, and have to go with setting ndots
.
Here is what Azure responds with when we add .
at the end.
GET https://tempo**************.blob.core.windows.net./**************?restype=container&timeout=61
Status: 400 Bad Request
Content-Length: [334]
Content-Type: [text/html; charset=us-ascii]
Date: [Thu, 11 Aug 2022 19:38:19 GMT]
Server: [Microsoft-HTTPAPI/2.0]
xml: (*azblob.storageError).UnmarshalXML did not consume entire <HTML> element
This looks like a bug, I will file a ticket with Azure.
note: this works fine with curl because curl trims the .
in the url, try it with chrome or other tools that doesn't trim .
@electron0zero Hi, thanks for the fix.
So after the new image is built, We also need to set tempo pods DNS config ndots=3
for resolving the issue, right?
We deploy tempo through Helm chart, not sure if there is a good way to have this setting
In our case, we were seeing these errors only in compactor pods, so we added ndots=3
in compactor pods for Azure.
We did this by setting compactor Pod's DNS Config
spec:
dnsConfig:
options:
- name: ndots
value: "3"
I am not sure about helm, If you are using the chart from https://github.com/grafana/helm-charts, can you open an issue in that repo please :)
Update from Azure support: Host headers are not allowed to have trailing dot in Azure services, you can use FQDN but need to set Host header without trailing dot.
-
curl -v --header 'Host: *****03.blob.core.windows.net' https://******03.blob.core.windows.net./*******-data?restype=container&timeout=61
-> this is allowed -
curl -v --header 'Host: *****03.blob.core.windows.net.' https://******03.blob.core.windows.net./*******-data?restype=container&timeout=61
-> this is NOT allowed, and will result in 400 (notice the dot in Host header)
We need to modify Azure client to trim traling dot from Host header for FQDN to work in Azure.
for example: curl trims traling dot from Host
header, see https://github.com/curl/curl/commit/3a6142865fec3c54cd7081ada86c93c135e4b32f for details.
Created https://github.com/grafana/tempo/issues/1726 to track this for Azure and other storage backends