tempo
tempo copied to clipboard
Ingester fails with "error searching block" when "replication_factor" is set to 3
Describe the bug
Our ingester logs the following errors and warnings:
level=warn ts=2023-12-15T08:49:13.134664609Z caller=server.go:1184 method=/tempopb.Querier/FindTraceByID duration=9.793361ms msg=gRPC err="completeBlock.FindTraceByID failed: error reading row from backend: missing required field: 1:FIELD<I32>"
level=error ts=2023-12-15T08:52:13.142414816Z caller=instance_search.go:144 msg="error searching block" blockID=88645ed5-56d3-4054-a392-7b6aea7a2a34 err="error opening file /bitnami/grafana-tempo/data/wal/88645ed5-56d3-4054-a392-7b6aea7a2a34+single-tenant+vParquet3/0000000029: error opening parquet file: reading magic footer of parquet file: context canceled"
We run the Ingester with 3 replicas and replication_factor: 3
.
We used to run the Ingester with only 2 replicas and replication_factor: 1
. If we change our current setup back to replication_factor 1 or even 2, the error is gone.
Currently we are a little bit confused by those errors. They are happening quite frequently but it does not seem to break anything.
To Reproduce At least reproducable in our environment throught all of our clusters:
- Run Ingesters with 3 replicas and
replication_factor: 1
- Change
replication_factor
to3
Expected behavior
Running with replication_factor: 3
works.
Environment:
- Infrastructure: Kubernetes
- Deployment tool: bitnami helm chart
- Tempo v2.3.1
Additional Context We needed to tune the limits quite a lot.
configs
Overrides:
overrides:
"*":
max_traces_per_user: 5000000
ingestion_burst_size_bytes: 500000000
ingestion_rate_limit_bytes: 500000000
metrics_generator_processors:
- service-graphs
- span-metrics
Config:
usage_report:
reporting_enabled: false
multitenancy_enabled: false
compactor:
compaction:
block_retention: 168h
ring:
kvstore:
store: memberlist
metrics_generator:
processor:
service_graphs:
max_items: 100000
dimensions:
- k8s.namespace.name
span_metrics:
dimensions:
- k8s.namespace.name
ring:
kvstore:
store: memberlist
storage:
path: {{ .Values.tempo.dataDir }}/wal
remote_write: {{ include "common.tplvalues.render" (dict "value" .Values.metricsGenerator.remoteWrite "context" $) | nindent 6 }}
distributor:
ring:
kvstore:
store: memberlist
receivers:
...snip
query_frontend:
max_outstanding_per_tenant: 3000
search:
concurrent_jobs: 2000
querier:
frontend_worker:
frontend_address: {{ include "grafana-tempo.query-frontend.fullname" . }}-headless:{{ .Values.queryFrontend.service.ports.grpc }}
grpc_client_config:
max_send_msg_size: 500000000
trace_by_id:
query_timeout: "60s"
search:
query_timeout: "120s"
max_concurrent_queries: 50
ingester:
lifecycler:
ring:
replication_factor: 3
kvstore:
store: memberlist
tokens_file_path: {{ .Values.tempo.dataDir }}/tokens.json
memberlist:
abort_if_cluster_join_fails: false
join_members:
- {{ include "grafana-tempo.gossip-ring.fullname" . }}
overrides:
per_tenant_override_config: /bitnami/grafana-tempo/conf/overrides.yaml
server:
http_server_read_timeout: "60s"
http_server_write_timeout: "60s"
http_listen_port: {{ .Values.tempo.containerPorts.web }}
grpc_server_max_recv_msg_size: 500000000
grpc_server_max_send_msg_size: 500000000
storage:
trace:
backend: s3
s3:
endpoint: ${BUCKET_HOST}
bucket: ${BUCKET_NAME}
access_key: ${AWS_ACCESS_KEY_ID}
secret_key: ${AWS_SECRET_ACCESS_KEY}
forcepathstyle: true
insecure: true
local:
path: {{ .Values.tempo.dataDir }}/traces
wal:
path: {{ .Values.tempo.dataDir }}/wal
cache: memcached
memcached:
consistent_hash: true
host: {{ include "grafana-tempo.memcached.url" . }}
service: memcache
timeout: 500ms
search:
cache_control:
footer: true
column_index: true
offset_index: true
block:
version: vParquet3
pool:
max_workers: 100
cache_min_compaction_level: 1
cache_max_block_age: 48h
Good catch. This is a innocuous error. In RF3 if the querier gets 2 responses it can be assured that it has received all spans for a trace (due to read/write quorum) and so it cancels the final request.
We should log context cancelled on query at the debug level in the ingester.
This issue has been automatically marked as stale because it has not had any activity in the past 60 days. The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed after 15 days if there is no new activity. Please apply keepalive label to exempt this Issue.