Our Pinot cluster is not highly available
Hey all!
We are facing serious issue of high availability despite replication across all components. When one out of three broker/server pods is unavailable, some of our queries fail with error
Unable to resolve host pinot-server-2.pinot-server-headless.pinot-dev.svc.cluster.local
or
Unable to resolve host pinot-broker-2.pinot-server-headless.pinot-dev.svc.cluster.local
So there is downtime whenever a server or broker pod restarts.
Cluster and Table Setup
Some context about our cluster configuration:
- Our cluster is deployed using pinot official helm chart on AWS EKS
- Machine Type : c6a.8xlarge (32 cores, 64 GB)
- Machine count : 6
- Pinot components : 3 servers, 3 controllers, 3 brokers, 3 minions, 5 zookeeper
- Resources : 3 * Server (4 cores, 14 GB, 1000 GB), 3 * Controller (4 cores, 14 GB, 100 GB), 3 * Broker (4 cores, 14 GB, 100GB), 5 * Zookeeper (4 cores, 14 GB, 100 GB)
- Pinot Version : 1.0.0 . Zookeeper version : 3.8.0-5a02a05eddb59aee6ac762f7ea82e92a68eb9c0f, built on 2022-02-25 08:49 UTC (deployed with pinot helm chart)
Here is our values.yaml
image:
repository: ${CI_REGISTRY_IMAGE}
tag: "${PINOT_IMAGE_TAG}"
pullPolicy: Always
imagePullSecrets:
- name: ${CI_PROJECT_NAME}
cluster:
name: "${PINOT_CLUSTER_NAME}"
serviceAccount:
annotations:
eks.amazonaws.com/role-arn: "${PINOT_IRSA_ROLE_ARN}"
name: "pinot"
probes:
initialDelaySeconds: 300
periodSeconds: 30
pinotAuth:
enabled: true
controllerFactoryClass: org.apache.pinot.controller.api.access.ZkBasicAuthAccessControlFactory
brokerFactoryClass: org.apache.pinot.broker.broker.ZkBasicAuthAccessControlFactory
configs:
- access.control.principals=admin
- access.control.principals.admin.password=${PINOT_ADMIN_PASSWORD}
- access.control.init.username=admin
- access.control.init.password=${PINOT_ADMIN_PASSWORD}
# ------------------------------------------------------------------------------
# Pinot Controller:
# ------------------------------------------------------------------------------
controller:
replicaCount: 3
probes:
livenessEnabled: true
readinessEnabled: true
persistence:
size: ${PINOT_CONTROLLER_VOL_SIZE}
storageClass: ${PINOT_STORAGE_CLASS}
data:
dir: "${PINOT_SEGMENT_DIR}"
podSecurityContext:
fsGroupChangePolicy: Always
runAsUser: 1000
runAsGroup: 3000
fsGroup: 3000
securityContext:
runAsNonRoot: true
runAsUser: 1000
runAsGroup: 3000
jvmOpts: "-XX:+ExitOnOutOfMemoryError -Xms1G -Xmx14G -Djute.maxbuffer=100000000 -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -Xlog:gc*:file=/opt/pinot/gc-pinot-controller.log -javaagent:/opt/pinot/etc/jmx_prometheus_javaagent/jmx_prometheus_javaagent.jar=8008:/opt/pinot/etc/jmx_prometheus_javaagent/configs/pinot.yml"
service:
annotations:
"prometheus.io/scrape": "true"
"prometheus.io/port": "8008"
external:
enabled: false
resources:
requests:
cpu : 4
memory: "14Gi"
nodeSelector:
workload-type: ${PINOT_WORKLOAD_TYPE}
podAnnotations:
"prometheus.io/scrape": "true"
"prometheus.io/port": "8008"
extraEnv:
- name: LOG4J_CONSOLE_LEVEL
value: error
# Extra configs will be appended to pinot-controller.conf file
extra:
configs: |-
pinot.set.instance.id.to.hostname=true
controller.task.scheduler.enabled=true
controller.task.frequencyPeriod=1h
access.control.init.username=admin
access.control.init.password=${PINOT_ADMIN_PASSWORD}
controller.local.temp.dir=/tmp/pinot-tmp-data/
controller.allow.hlc.tables=false
controller.enable.split.commit=true
controller.realtime.segment.deepStoreUploadRetryEnabled=true
controller.segment.fetcher.auth.token=${PINOT_AUTH_TOKEN}
pinot.controller.storage.factory.s3.disableAcl=false
pinot.controller.storage.factory.class.s3=org.apache.pinot.plugin.filesystem.S3PinotFS
pinot.controller.storage.factory.s3.region=${AWS_S3_REGION}
pinot.controller.storage.factory.s3.httpclient.maxConnections=100
pinot.controller.storage.factory.s3.httpclient.socketTimeout=30s
pinot.controller.storage.factory.s3.httpclient.connectionTimeout=2s
pinot.controller.storage.factory.s3.httpclient.connectionTimeToLive=0s
pinot.controller.storage.factory.s3.httpclient.connectionAcquisitionTimeout=10s
pinot.controller.segment.fetcher.protocols=file,http,s3
pinot.controller.segment.fetcher.s3.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher
pinot.multistage.engine.enabled=true
pinot.server.instance.currentDataTableVersion=4
pinot.query.server.port=8421
pinot.query.runner.port=8442
pinot.query.scheduler.accounting.factory.name=org.apache.pinot.core.accounting.PerQueryCPUMemAccountantFactory
pinot.query.scheduler.accounting.enable.thread.memory.sampling=true
pinot.query.scheduler.accounting.enable.thread.cpu.sampling=true
pinot.query.scheduler.accounting.oom.enable.killing.query=true
pinot.query.scheduler.accounting.publishing.jvm.heap.usage=true
# ------------------------------------------------------------------------------
# Pinot Broker:
# ------------------------------------------------------------------------------
broker:
replicaCount: 3
jvmOpts: "-XX:+ExitOnOutOfMemoryError -Xms1G -Xmx14G -Djute.maxbuffer=100000000 -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -Xlog:gc*:file=/opt/pinot/gc-pinot-controller.log -javaagent:/opt/pinot/etc/jmx_prometheus_javaagent/jmx_prometheus_javaagent.jar=8008:/opt/pinot/etc/jmx_prometheus_javaagent/configs/pinot.yml"
podSecurityContext:
fsGroupChangePolicy: Always
runAsUser: 1000
runAsGroup: 3000
fsGroup: 3000
securityContext:
runAsNonRoot: true
runAsUser: 1000
runAsGroup: 3000
service:
annotations:
"prometheus.io/scrape": "true"
"prometheus.io/port": "8008"
external:
enabled: false
ingress:
v1:
enabled: true
ingressClassName: ""
annotations:
kubernetes.io/ingress.class: alb
alb.ingress.kubernetes.io/target-type: ip
alb.ingress.kubernetes.io/target-group-attributes: stickiness.enabled=true,stickiness.lb_cookie.duration_seconds=60
alb.ingress.kubernetes.io/certificate-arn: "${PINOT_BROKER_ALB_ACM_CERTIFICATE_ARN}"
alb.ingress.kubernetes.io/listen-ports: '[{"HTTPS":443}]'
alb.ingress.kubernetes.io/load-balancer-attributes: access_logs.s3.enabled=false
alb.ingress.kubernetes.io/scheme: internet-facing
alb.ingress.kubernetes.io/security-groups: ${PINOT_BROKER_ALB_SECURITY_GROUP}
alb.ingress.kubernetes.io/ssl-policy: ELBSecurityPolicy-TLS-1-2-Ext-2018-06
alb.ingress.kubernetes.io/tags: ${PINOT_BROKER_ALB_TAGS}
tls: []
path: /
hosts:
- ${PINOT_BROKER_ALB_HOST}
resources:
requests:
cpu : 4
memory: "14Gi"
nodeSelector:
workload-type: ${PINOT_WORKLOAD_TYPE}
podAnnotations:
"prometheus.io/scrape": "true"
"prometheus.io/port": "8008"
extraEnv:
- name: LOG4J_CONSOLE_LEVEL
value: debug
extra:
configs: |-
pinot.set.instance.id.to.hostname=true
pinot.multistage.engine.enabled=true
pinot.server.instance.currentDataTableVersion=4
pinot.query.server.port=8421
pinot.query.runner.port=8442
pinot.broker.enable.query.cancellation=true
pinot.query.scheduler.accounting.factory.name=org.apache.pinot.core.accounting.PerQueryCPUMemAccountantFactory
pinot.query.scheduler.accounting.enable.thread.memory.sampling=true
pinot.query.scheduler.accounting.enable.thread.cpu.sampling=true
pinot.query.scheduler.accounting.oom.enable.killing.query=true
pinot.query.scheduler.accounting.publishing.jvm.heap.usage=true
# ------------------------------------------------------------------------------
# Pinot Server:
# ------------------------------------------------------------------------------
server:
replicaCount: 3
probes:
livenessEnabled: true
readinessEnabled: true
podSecurityContext:
fsGroupChangePolicy: Always
runAsUser: 1000
runAsGroup: 3000
fsGroup: 3000
securityContext:
runAsNonRoot: true
runAsUser: 1000
runAsGroup: 3000
persistence:
size: ${PINOT_SERVER_VOL_SIZE}
storageClass: ${PINOT_STORAGE_CLASS}
jvmOpts: "-XX:+ExitOnOutOfMemoryError -Xms1G -Xmx6G -Djute.maxbuffer=100000000 -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -Xlog:gc*:file=/opt/pinot/gc-pinot-controller.log -javaagent:/opt/pinot/etc/jmx_prometheus_javaagent/jmx_prometheus_javaagent.jar=8008:/opt/pinot/etc/jmx_prometheus_javaagent/configs/pinot.yml"
resources:
requests:
cpu : 4
memory: "14Gi"
nodeSelector:
workload-type: ${PINOT_WORKLOAD_TYPE}
podAnnotations:
"prometheus.io/scrape": "true"
"prometheus.io/port": "8008"
extraEnv:
- name: LOG4J_CONSOLE_LEVEL
value: error
extra:
configs: |-
pinot.set.instance.id.to.hostname=true
pinot.server.instance.realtime.alloc.offheap=true
pinot.server.instance.enable.split.commit=true
realtime.segment.serverUploadToDeepStore=true
pinot.server.instance.segment.store.uri=${PINOT_SEGMENT_DIR}
pinot.server.storage.factory.s3.disableAcl=false
pinot.server.storage.factory.class.s3=org.apache.pinot.plugin.filesystem.S3PinotFS
pinot.server.storage.factory.s3.region=${AWS_S3_REGION}
pinot.server.segment.fetcher.protocols=file,http,s3
pinot.server.storage.factory.s3.httpclient.maxConnections=1000
pinot.server.storage.factory.s3.httpclient.socketTimeout=30s
pinot.server.storage.factory.s3.httpclient.connectionTimeout=2s
pinot.server.storage.factory.s3.httpclient.connectionTimeToLive=0s
pinot.server.storage.factory.s3.httpclient.connectionAcquisitionTimeout=10s
pinot.server.segment.fetcher.s3.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher
pinot.server.segment.fetcher.auth.token=${PINOT_AUTH_TOKEN}
pinot.server.segment.uploader.auth.token=${PINOT_AUTH_TOKEN}
pinot.server.instance.auth.token=${PINOT_AUTH_TOKEN}
pinot.multistage.engine.enabled=true
pinot.server.instance.currentDataTableVersion=4
pinot.query.server.port=8421
pinot.query.runner.port=8442
pinot.server.enable.query.cancellation=true
pinot.query.scheduler.accounting.factory.name=org.apache.pinot.core.accounting.PerQueryCPUMemAccountantFactory
pinot.query.scheduler.accounting.enable.thread.memory.sampling=true
pinot.query.scheduler.accounting.enable.thread.cpu.sampling=true
pinot.query.scheduler.accounting.oom.enable.killing.query=true
pinot.query.scheduler.accounting.publishing.jvm.heap.usage=true
# ------------------------------------------------------------------------------
# Pinot Minion:
# ------------------------------------------------------------------------------
minionStateless:
enabled: false
minion:
enabled: true
replicaCount: 3
dataDir: "${PINOT_MINION_DATA_DIR}"
jvmOpts: "-XX:+ExitOnOutOfMemoryError -Xms1G -Xmx8G -Djute.maxbuffer=100000000 -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -Xlog:gc*:file=/opt/pinot/gc-pinot-controller.log"
podSecurityContext:
fsGroupChangePolicy: Always
runAsUser: 1000
runAsGroup: 3000
fsGroup: 3000
securityContext:
runAsNonRoot: true
runAsUser: 1000
runAsGroup: 3000
persistence:
enabled: true
accessMode: ReadWriteOnce
size: ${PINOT_MINION_VOL_SIZE}
storageClass: ${PINOT_STORAGE_CLASS}
resources:
requests:
cpu : 4
memory: "14Gi"
nodeSelector:
workload-type: ${PINOT_WORKLOAD_TYPE}
podAnnotations:
"prometheus.io/scrape": "true"
"prometheus.io/port": "8008"
extraEnv:
- name: LOG4J_CONSOLE_LEVEL
value: error
extra:
configs: |-
pinot.set.instance.id.to.hostname=true
pinot.minion.storage.factory.class.s3=org.apache.pinot.plugin.filesystem.S3PinotFS
pinot.minion.storage.factory.s3.region=${AWS_S3_REGION}
pinot.minion.segment.fetcher.protocols=file,http,s3
pinot.minion.segment.fetcher.s3.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher
pinot.minion.storage.factory.s3.httpclient.maxConnections=1000
pinot.minion.storage.factory.s3.httpclient.socketTimeout=30s
pinot.minion.storage.factory.s3.httpclient.connectionTimeout=2s
pinot.minion.storage.factory.s3.httpclient.connectionTimeToLive=0s
pinot.minion.storage.factory.s3.httpclient.connectionAcquisitionTimeout=10s
segment.fetcher.auth.token=${PINOT_AUTH_TOKEN}
task.auth.token=${PINOT_AUTH_TOKEN}
pinot.multistage.engine.enabled=true
pinot.server.instance.currentDataTableVersion=4
pinot.query.server.port=8421
pinot.query.runner.port=8442
pinot.query.scheduler.accounting.factory.name=org.apache.pinot.core.accounting.PerQueryCPUMemAccountantFactory
pinot.query.scheduler.accounting.enable.thread.memory.sampling=true
pinot.query.scheduler.accounting.enable.thread.cpu.sampling=true
pinot.query.scheduler.accounting.oom.enable.killing.query=true
pinot.query.scheduler.accounting.publishing.jvm.heap.usage=true
zookeeper:
enabled: true
urlOverride: "my-zookeeper:2181/my-pinot"
port: 2181
podSecurityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 1001
containerSecurityContext:
runAsNonRoot: true
runAsUser: 1000
env:
# https://github.com/mrbobbytables/zookeeper/blob/master/README.md
ZOO_HEAP_SIZE: "10G"
ZOOKEEPER_LOG_STDOUT_THRESHOLD: "ERROR"
JAVA_OPTS: "-XX:+ExitOnOutOfMemoryError -Xms4G -Xmx10G -Djute.maxbuffer=100000000 -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -Xlog:gc*:file=/opt/bitnami/zookeeper/logs/gc-pinot-zookeeper.log"
resources:
requests:
cpu : 4
memory: "14Gi"
replicaCount: 3
persistence:
enabled: true
size: ${PINOT_ZOOKEEPER_VOL_SIZE}
storageClass: ${PINOT_STORAGE_CLASS}
image:
PullPolicy: "IfNotPresent"
nodeSelector:
workload-type: ${PINOT_WORKLOAD_TYPE}
# References
# https://docs.pinot.apache.org/operators/operating-pinot/oom-protection-using-automatic-query-killing
# https://docs.pinot.apache.org/operators/tutorials/deployment-pinot-on-kubernetes
# https://startree.ai/blog/capacity-planning-in-apache-pinot-part-1
# https://startree.ai/blog/capacity-planning-in-apache-pinot-part-2
We query Pinot using three ways:
- Via Superset with URL : pinot://
: @ : /query/sql?controller=http:// : /verify_ssl=true - Via Pinot Broker Ingress Load Balancer (deployed with pinot helm chart)
- Via Pinot admin UI accessible by port-forwarding pinot-controller service object.
Our standard offline tables have the following config
{
"tableName": "pinot_metadata_feeds",
"tableType": "OFFLINE",
"segmentsConfig": {
"retentionTimeUnit": "DAYS",
"retentionTimeValue": "365",
"schemaName": "pinot_metadata_feeds",
"replication": "3",
"replicasPerPartition": "3",
"segmentPushType": "APPEND",
"segmentAssignmentStrategy": "BalanceNumSegmentAssignmentStrategy"
},
"ingestionConfig": {},
"task": {
"taskTypeConfigsMap": {
"MergeRollupTask": {
"1day.mergeType": "concat",
"1day.bucketTimePeriod": "1d",
"1day.bufferTimePeriod": "1d"
}
}
},
"tenants": {},
"tableIndexConfig": {
"loadMode": "MMAP",
"nullHandlingEnabled": "true"
},
"metadata": {
"customConfigs": {}
}
}
Problem
When broker or server pod get restarted during cluster update or when our ops team make changes to kubernetes cluster, some of our queries fail.
With multistage disabled : It seems like the queries are routed in round robin fashion. If you retry the same query for 5 times, it will fail 1-2 times when it reach the server pod or broker pod which is restarting. For 3-4 times, it reach the healthy broker/server pods and return result.
With multistage enabled: The queries almost always fail when one of the broker or server pod is restarting. It seems the queries are fanning out to all servers.
Disabling multistage is not an option for us since we are using joins in some queries.
The error log we get when for example server-2 is restarting
Error dispatching query to server=pinot-server-2.pinot-server-headless.pinot-dev.svc.cluster.local@{8421,8442} stage=1
org.apache.pinot.query.service.dispatch.QueryDispatcher.submit(QueryDispatcher.java:144)
org.apache.pinot.query.service.dispatch.QueryDispatcher.submitAndReduce(QueryDispatcher.java:93)
org.apache.pinot.broker.requesthandler.MultiStageBrokerRequestHandler.handleRequest(MultiStageBrokerRequestHandler.java:179)
org.apache.pinot.broker.requesthandler.BaseBrokerRequestHandler.handleRequest(BaseBrokerRequestHandler.java:263)
UNAVAILABLE: Unable to resolve host pinot-server-2.pinot-server-headless.pinot-dev.svc.cluster.local
io.grpc.Status.asRuntimeException(Status.java:539)
io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:487)
io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:576)
io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:70)
java.net.UnknownHostException: pinot-server-2.pinot-server-headless.pinot-dev.svc.cluster.local: Name or service not known
io.grpc.internal.DnsNameResolver.resolveAddresses(DnsNameResolver.java:223)
io.grpc.internal.DnsNameResolver.doResolve(DnsNameResolver.java:282)
io.grpc.internal.DnsNameResolver$Resolve.run(DnsNameResolver.java:318)
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
pinot-server-2.pinot-server-headless.pinot-dev.svc.cluster.local: Name or service not known
java.base/java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
java.base/java.net.InetAddress$PlatformNameService.lookupAllHostAddr(InetAddress.java:930)
java.base/java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1543)
java.base/java.net.InetAddress$NameServiceAddresses.get(InetAddress.java:848)
Full query response in Pinot UI for a succeeded query with multistage enabled (here you can see the query is being routed to multiple servers) SELECT COUNT(*) FROM pinot_metadata_feeds;
{
"resultTable": {
"dataSchema": {
"columnNames": [
"EXPR$0"
],
"columnDataTypes": [
"LONG"
]
},
"rows": [
[
1206298
]
]
},
"requestId": "132994053000002907",
"stageStats": {
"1": {
"numBlocks": 6,
"numRows": 3,
"stageExecutionTimeMs": 373,
"stageExecutionUnit": 3,
"stageExecWallTimeMs": 125,
"operatorStats": {
"AggregateOperator_1_0@pinot-server-2.pinot-server-headless.pinot-dev.svc.cluster.local:8442": {
"numBlocks": "2",
"numRows": "1",
"operatorExecutionTimeMs": "124",
"operatorExecStartTimeMs": "1712834980239",
"operatorId": "AggregateOperator_1_0@pinot-server-2.pinot-server-headless.pinot-dev.svc.cluster.local:8442",
"operatorExecEndTimeMs": "1712834980364"
},
"MailboxSendOperator_1_0@pinot-server-2.pinot-server-headless.pinot-dev.svc.cluster.local:8442": {
"numBlocks": "1",
"numRows": "1",
"operatorExecutionTimeMs": "125",
"operatorExecStartTimeMs": "1712834980239",
"operatorId": "MailboxSendOperator_1_0@pinot-server-2.pinot-server-headless.pinot-dev.svc.cluster.local:8442",
"operatorExecEndTimeMs": "1712834980364"
},
"MailboxReceiveOperator_1_0@pinot-server-2.pinot-server-headless.pinot-dev.svc.cluster.local:8442": {
"numBlocks": "3",
"numRows": "1",
"operatorExecutionTimeMs": "124",
"operatorExecStartTimeMs": "1712834980239",
"operatorId": "MailboxReceiveOperator_1_0@pinot-server-2.pinot-server-headless.pinot-dev.svc.cluster.local:8442",
"operatorExecEndTimeMs": "1712834980364"
}
}
},
"2": {
"numBlocks": 3,
"numRows": 2,
"stageExecutionTimeMs": 136,
"stageExecutionUnit": 2,
"stageExecWallTimeMs": 68,
"numSegmentsQueried": 14145,
"numSegmentsProcessed": 14145,
"numSegmentsMatched": 14145,
"numDocsScanned": 1206298,
"totalDocs": 1206298,
"traceInfo": {
"pinot_metadata_feeds": "[{\"0\":[{\"SegmentPrunerService Time\":8},{\"CombinePlanNode Time\":5},{\"AggregationCombineOperator Time\":35},{\"StreamingInstanceResponseOperator Time\":35}]},{\"0_0\":[]},{\"0_1\":[]},{\"0_2\":[{\"FastFilteredCountOperator Time\":0},{\"FastFilteredCountOperator Time\":0},{\"FastFilteredCountOperator Time\":0},..,{\"FastFilteredCountOperator Time\":0},{\"FastFilteredCountOperator Time\":0}]}]"
},
"operatorStats": {
"MailboxSendOperator_2_0@pinot-server-1.pinot-server-headless.pinot-dev.svc.cluster.local:8442": {
"numBlocks": "1",
"numRows": "1",
"operatorExecutionTimeMs": "68",
"operatorExecStartTimeMs": "1712834980286",
"operatorId": "MailboxSendOperator_2_0@pinot-server-1.pinot-server-headless.pinot-dev.svc.cluster.local:8442",
"operatorExecEndTimeMs": "1712834980345",
"table": "pinot_metadata_feeds"
},
"LeafStageTransferableBlockOperator_2_0@pinot-server-1.pinot-server-headless.pinot-dev.svc.cluster.local:8442": {
"numConsumingSegmentsProcessed": "0",
"numSegmentsPrunedByInvalid": "0",
"numSegmentsPrunedByValue": "0",
"numRows": "1",
"numEntriesScannedPostFilter": "0",
"numDocsScanned": "1206298",
"numSegmentsMatched": "14145",
"numSegmentsPrunedByLimit": "0",
"timeUsedMs": "68",
"operatorExecEndTimeMs": "1712834980354",
"totalDocs": "1206298",
"numConsumingSegmentsMatched": "0",
"numSegmentsQueried": "14145",
"numBlocks": "2",
"traceInfo": "[{\"0\":[{\"SegmentPrunerService Time\":8},{\"CombinePlanNode Time\":5},{\"AggregationCombineOperator Time\":35},{\"StreamingInstanceResponseOperator Time\":35}]},{\"0_0\":[]},{\"0_1\":[]},{\"0_2\":[{\"FastFilteredCountOperator Time\":0},{\"FastFilteredCountOperator Time\":0},{\"FastFilteredCountOperator Time\":0},{\"FastFilteredCountOperator Time\":0},..{\"FastFilteredCountOperator Time\":0},{\"FastFilteredCountOperator Time\":0}]}]",
"operatorExecutionTimeMs": "68",
"operatorExecStartTimeMs": "1712834980286",
"numSegmentsPrunedByServer": "0",
"numSegmentsProcessed": "14145",
"operatorId": "LeafStageTransferableBlockOperator_2_0@pinot-server-1.pinot-server-headless.pinot-dev.svc.cluster.local:8442",
"numEntriesScannedInFilter": "0",
"table": "pinot_metadata_feeds"
}
},
"tableNames": [
"pinot_metadata_feeds"
]
}
},
"exceptions": [],
"numServersQueried": 0,
"numServersResponded": 0,
"numSegmentsQueried": 14145,
"numSegmentsProcessed": 14145,
"numSegmentsMatched": 14145,
"numConsumingSegmentsQueried": 0,
"numConsumingSegmentsProcessed": 0,
"numConsumingSegmentsMatched": 0,
"numDocsScanned": 1206298,
"numEntriesScannedInFilter": 0,
"numEntriesScannedPostFilter": 0,
"numGroupsLimitReached": false,
"totalDocs": 1206298,
"timeUsedMs": 212,
"offlineThreadCpuTimeNs": 0,
"realtimeThreadCpuTimeNs": 0,
"offlineSystemActivitiesCpuTimeNs": 0,
"realtimeSystemActivitiesCpuTimeNs": 0,
"offlineResponseSerializationCpuTimeNs": 0,
"realtimeResponseSerializationCpuTimeNs": 0,
"offlineTotalCpuTimeNs": 0,
"realtimeTotalCpuTimeNs": 0,
"segmentStatistics": [],
"traceInfo": {},
"minConsumingFreshnessTimeMs": 0,
"numSegmentsPrunedByBroker": 0,
"numSegmentsPrunedByServer": 0,
"numSegmentsPrunedInvalid": 0,
"numSegmentsPrunedByLimit": 0,
"numSegmentsPrunedByValue": 0,
"explainPlanNumEmptyFilterSegments": 0,
"explainPlanNumMatchAllFilterSegments": 0,
"brokerId": "Broker_pinot-broker-2.pinot-broker-headless.pinot-dev.svc.cluster.local_8099",
"brokerReduceTimeMs": 86,
"numRowsResultSet": 1
}
Full query response in Pinot UI for succeeded query with multistage disabled SELECT COUNT(*) FROM pinot_metadata_feeds;
{
"resultTable": {
"dataSchema": {
"columnNames": [
"count(*)"
],
"columnDataTypes": [
"LONG"
]
},
"rows": [
[
1206170
]
]
},
"requestId": "132994053000000022",
"brokerId": "Broker_pinot-broker-2.pinot-broker-headless.pinot-dev.svc.cluster.local_8099",
"exceptions": [],
"numServersQueried": 1,
"numServersResponded": 1,
"numSegmentsQueried": 14144,
"numSegmentsProcessed": 14144,
"numSegmentsMatched": 14144,
"numConsumingSegmentsQueried": 0,
"numConsumingSegmentsProcessed": 0,
"numConsumingSegmentsMatched": 0,
"numDocsScanned": 1206170,
"numEntriesScannedInFilter": 0,
"numEntriesScannedPostFilter": 0,
"numGroupsLimitReached": false,
"totalDocs": 1206170,
"timeUsedMs": 189,
"offlineThreadCpuTimeNs": 0,
"realtimeThreadCpuTimeNs": 0,
"offlineSystemActivitiesCpuTimeNs": 0,
"realtimeSystemActivitiesCpuTimeNs": 0,
"offlineResponseSerializationCpuTimeNs": 0,
"realtimeResponseSerializationCpuTimeNs": 0,
"offlineTotalCpuTimeNs": 0,
"realtimeTotalCpuTimeNs": 0,
"brokerReduceTimeMs": 0,
"segmentStatistics": [],
"traceInfo": {
"pinot-server-2_O": "[{\"0\":[{\"SegmentPrunerService Time\":6},{\"CombinePlanNode Time\":3},{\"AggregationCombineOperator Time\":49},{\"InstanceResponseOperator Time\":49}]},{\"0_0\":[]},{\"0_1\":[]},{\"0_2\":[{\"FastFilteredCountOperator Time\":0},{\"FastFilteredCountOperator Time\":0},{\"FastFilteredCountOperator Time\":0},{\"FastFilteredCountOperator Time\":0},...{\"FastFilteredCountOperator Time\":0},{\"FastFilteredCountOperator Time\":0},{\"FastFilteredCountOperator Time\":0}]}]"
},
"minConsumingFreshnessTimeMs": 0,
"numSegmentsPrunedByBroker": 0,
"numSegmentsPrunedByServer": 0,
"numSegmentsPrunedInvalid": 0,
"numSegmentsPrunedByLimit": 0,
"numSegmentsPrunedByValue": 0,
"explainPlanNumEmptyFilterSegments": 0,
"explainPlanNumMatchAllFilterSegments": 0,
"numRowsResultSet": 1
}
Expectation
Since we have 3 replicas for every segment and 3 replicas for every component, pinot must only route the queries to heathy broker/server pods and the query must not fail in case 1 out 3 server/broker pod is unavailable.
Solutions tried so far
- https://docs.pinot.apache.org/operators/operating-pinot/tuning/query-routing-using-adaptive-server-selection Added the following to broker-conf
pinot.broker.adaptive.server.selector.enable.stats.collection = true
pinot.broker.adaptive.server.selector.type=HYBRID
- Enabled replicaGroup using default tags
{
"tableName": "pinot_metadata_feeds",
"tableType": "OFFLINE",
"quota": {
"maxQueriesPerSecond": 300,
"storage": "140G"
},
"routing": {
"segmentPrunerTypes": ["partition"],
"instanceSelectorType": "replicaGroup"
},
"segmentsConfig": {
"retentionTimeUnit": "DAYS",
"retentionTimeValue": "3650",
"schemaName": "pinot_metadata_feeds",
"replication": "3",
"replicasPerPartition": "1",
"segmentPushType": "APPEND",
"segmentAssignmentStrategy": "BalanceNumSegmentAssignmentStrategy"
},
"ingestionConfig": {},
"task": {
"taskTypeConfigsMap": {
"MergeRollupTask": {
"1day.mergeType": "concat",
"1day.bucketTimePeriod": "1d",
"1day.bufferTimePeriod": "1d"
}
}
},
"tenants": {},
"tableIndexConfig": {
"loadMode": "MMAP",
"nullHandlingEnabled": "true"
},
"instanceAssignmentConfigMap": {
"OFFLINE": {
"tagPoolConfig": {
"tag": "DefaultTenant_OFFLINE"
},
"replicaGroupPartitionConfig": {
"replicaGroupBased": true,
"numInstances": 3,
"numReplicaGroups": 3,
"numInstancesPerReplicaGroup": 1
}
}
},
"metadata": {
"customConfigs": {}
}
}
- Upgrade 1.0.0 to 1.1.0 : Will test in the evening.
As of now, Pinot for us is not highly available despite following all best practices regarding replication. This scares us currently as we can face downtime any day if kubernetes restart one pod randomly which is not uncommon. This issue is quite important for us as it was our base assumption that Pinot is Highly Available. Any help is much appreciated!
Thanks!
FYI : This issue was first discussed on Slack here
@piby180 We are also having similar configurations which have deployed in prod. It would be great if you can post your observations here. Couple of questions,
- Does Pinot upgrade to 1.1.0 helps to mitigate any of the issues?
- How are you dealing with upgrades in prod currently? Does a major upgrade causing any compatibility issues?
Hi @vineethvp
-
Pinot upgrade to 1.1.0 didn't help us.
-
We are still facing the issue but we have reduced the downtime window by optimally using probes especially startup probes. I have recently supported a PR to allow flexibility in adjusting probe settings. This was merged recently to pinot helm chart.
On slack, I see quite a few users are experiencing this issue.
The core issue is a server/broker is ”marked“ ready by Pinot but it is not yet ”marked“ ready by kubernetes because readiness probe hasn't succeeded yet. Because of this, Pinot start sending traffic to the server/broker but the server dns does not respond properly.
Pinot needs to wait until the server/broker dns endpoint is working again.
cc @xiangfu0 @Jackie-Jiang
Can you describe the Pinot server statefulset and check the liveness probe? I have a feeling that the pod went down faster than the server was marked down.
Ideally the Pinot server should be marked down first, then broker won't route requests to it, then k8s can shut it down.
We have terminationGracePeriodSeconds: 300 . This means k8s will allow a grace period of 5 mins before the server can shut down gracefully. After 5 mins, k8s will then kill the server pod. Our server usually terminate gracefully in 1-2 mins. So, I think the server pods are shutting down gracefully.
I see this error after server pod has finished restarting and now has healthy logs but the status of the pod is still not Ready yet. When the status of the pod is Ready, the error disappears.
Here is the yaml for our server statefulset
apiVersion: apps/v1
kind: StatefulSet
metadata:
annotations:
meta.helm.sh/release-name: pinot
meta.helm.sh/release-namespace: pinot-dev
creationTimestamp: "2024-04-09T15:24:16Z"
generation: 23
labels:
app: pinot
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/version: 1.0.0
component: server
helm.sh/chart: pinot-0.2.9
heritage: Helm
release: pinot
name: pinot-server
namespace: pinot-dev
resourceVersion: "1272018086"
uid: 04b4409e-c877-4210-9083-97d4652f7174
spec:
persistentVolumeClaimRetentionPolicy:
whenDeleted: Retain
whenScaled: Retain
podManagementPolicy: Parallel
replicas: 3
revisionHistoryLimit: 10
selector:
matchLabels:
app: pinot
component: server
release: pinot
serviceName: pinot-server-headless
template:
metadata:
annotations:
checksum/config: 5cb2dcd44b42446bd08039c9485c290d7f2d4b3e2bd080090c5a19f6eda456a8
kubectl.kubernetes.io/restartedAt: "2024-06-10T14:43:42Z"
prometheus.io/port: "8008"
prometheus.io/scrape: "true"
creationTimestamp: null
labels:
app: pinot
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/version: 1.0.0
component: server
helm.sh/chart: pinot-0.2.9
heritage: Helm
release: pinot
spec:
affinity: {}
containers:
- args:
- StartServer
- -clusterName
- pinot
- -zkAddress
- pinot-zookeeper:2181
- -configFileName
- /var/pinot/server/config/pinot-server.conf
env:
- name: JAVA_OPTS
value: -XX:+ExitOnOutOfMemoryError -Xms1G -Xmx6G -Djute.maxbuffer=100000000
-XX:+UseG1GC -XX:MaxGCPauseMillis=200 -Xlog:gc*:file=/opt/pinot/gc-pinot-controller.log -javaagent:/opt/pinot/etc/jmx_prometheus_javaagent/jmx_prometheus_javaagent.jar=8008:/opt/pinot/etc/jmx_prometheus_javaagent/configs/pinot.yml
-Dlog4j2.configurationFile=/opt/pinot/etc/conf/pinot-server-log4j2.xml
-Dplugins.dir=/opt/pinot/plugins
- name: LOG4J_CONSOLE_LEVEL
value: error
image: <registry>/pinot:release-1.1.0
imagePullPolicy: Always
livenessProbe:
failureThreshold: 10
httpGet:
path: /health/liveness
port: 8097
scheme: HTTP
initialDelaySeconds: 60
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
name: server
ports:
- containerPort: 8098
name: netty
protocol: TCP
- containerPort: 8097
name: admin
protocol: TCP
readinessProbe:
failureThreshold: 10
httpGet:
path: /health/readiness
port: 8097
scheme: HTTP
initialDelaySeconds: 60
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
resources:
limits:
cpu: "4"
memory: 14Gi
requests:
cpu: "4"
memory: 14Gi
securityContext:
runAsGroup: 3000
runAsNonRoot: true
runAsUser: 1000
startupProbe:
failureThreshold: 120
httpGet:
path: /health/liveness
port: 8097
scheme: HTTP
initialDelaySeconds: 60
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/pinot/server/config
name: config
- mountPath: /var/pinot/server/data
name: data
dnsPolicy: ClusterFirst
imagePullSecrets:
- name: pinot
nodeSelector:
workload-type: pinot
restartPolicy: Always
schedulerName: default-scheduler
securityContext:
fsGroup: 3000
fsGroupChangePolicy: Always
runAsGroup: 3000
runAsUser: 1000
serviceAccount: pinot
serviceAccountName: pinot
terminationGracePeriodSeconds: 300
volumes:
- configMap:
defaultMode: 420
name: pinot-server-config
name: config
updateStrategy:
type: RollingUpdate
volumeClaimTemplates:
- apiVersion: v1
kind: PersistentVolumeClaim
metadata:
creationTimestamp: null
name: data
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1T
storageClassName: ebs-csi-gp3-pinot
volumeMode: Filesystem
```
@vineethvp @xiangfu0
Looks like I finally hit the jackpot on this issue after 2 months.
I have added the following to pinot-server-headless and pinot-broker-headless services and the error seems to disappear.
spec:
publishNotReadyAddresses: true
Setting this to true means k8s will allow DNS lookup irrespective of whether the server pod is ready or not. Whether to query a server or not now solely depends on Pinot
From Kubernetes docs
publishNotReadyAddresses indicates that any agent which deals with endpoints for this Service should disregard any indications of ready/not-ready. The primary use case for setting this field is for a StatefulSet's Headless Service to propagate SRV DNS records for its Pods for the purpose of peer discovery. The Kubernetes controllers that generate Endpoints and EndpointSlice resources for Services interpret this to mean that all endpoints are considered "ready" even if the Pods themselves are not. Agents which consume only Kubernetes generated endpoints through the Endpoints or EndpointSlice resources can safely assume this behavior.
https://kubernetes.io/docs/reference/kubernetes-api/service-resources/service-v1/#ServiceSpec
@vineethvp Could you test it on your end if adding publishNotReadyAddresses: true to pinot-broker-headless and pinot-server-headless fix the issue?
cc @zhtaoxiang ^^