ZooKeeper Contention During Batch Ingestion
ZooKeeper Contention and Linear Performance Degradation with High-Parallelism Batch Ingestion
Environment
- Pinot Version: apachepinot/pinot:latest
- Deployment: Kubernetes
- Cluster Size: 1 Controller, 1 ZooKeeper, 3 Brokers, 3 Servers
-
ZooKeeper Config:
-
ZOO_SNAPCOUNT=100000 -
ZOO_AUTOPURGE_INTERVAL=1 -
ZOO_AUTOPURGE_RETAIN_COUNT=5 - Heap: 1GB
- Storage: 20GB used
-
Problem Description
We're experiencing severe ZooKeeper contention and linear performance degradation during batch ingestion of 50,000 parquet files (each around 350 MB) using 100 parallel Kubernetes Job workers with jobType: SegmentCreationAndMetadataPush.
Performance Degradation Timeline
| Time Elapsed | Files/Worker | Total Segments | Time per File | Degradation Factor |
|---|---|---|---|---|
| Initial | 0-5 | 500 | 2 minutes | 1x (baseline) |
| +3 hours | 8-9 | 1,002 | 20 minutes | 10x |
| +30 hours | 43-44 | 4,438 | 88 minutes | 44x |
| +50 hours | 73-76 | 6,150 | 175 minutes | 87x |
Performance continues to degrade linearly as segment count grows.
Critical Finding: We tested with 5, 20 and 100 workers. Linear degradation occurs at all parallelism levels, indicating the bottleneck is not just concurrency, but the O(n) cost of reading/writing the growing segment list. It seems the ingestion workers take a little time to build out the segments locally but spend 99% time on trying to upload metadata.
Observed Behavior
1. ZooKeeper Metadata Update Failures (17% Error Rate)
Segment uploads frequently fail with optimistic locking errors. In the last 10,000 log lines, we observed 111 ZK version conflicts out of 661 upload attempts (16.8% failure rate):
2025/11/09 15:47:07.055 ERROR [ZKOperator] [jersey-server-managed-async-executor-128]
Failed to update ZK metadata for segment: post_metrics_OFFLINE_20315_20344_000000021514_0,
table: post_metrics_OFFLINE, expected version: 3428
2025/11/09 15:47:07.055 ERROR [PinotSegmentUploadDownloadRestletResource] [jersey-server-managed-async-executor-128]
Exception while uploading segment: Failed to update ZK metadata for segment: post_metrics_OFFLINE_20315_20344_000000021514_0,
table: post_metrics_OFFLINE, expected version: 3428
java.lang.RuntimeException: Failed to update ZK metadata for segment: post_metrics_OFFLINE_20315_20344_000000021514_0,
table: post_metrics_OFFLINE, expected version: 3428
at org.apache.pinot.controller.api.upload.ZKOperator.processExistingSegment(ZKOperator.java:341)
at org.apache.pinot.controller.api.upload.ZKOperator.completeSegmentOperations(ZKOperator.java:120)
at org.apache.pinot.controller.api.resources.PinotSegmentUploadDownloadRestletResource.uploadSegment(PinotSegmentUploadDownloadRestletResource.java:433)
2. Large ZooKeeper Transaction Logs
ZooKeeper transaction logs have grown to multi-GB sizes:
$ kubectl exec pinot-zookeeper-0 -- ls -lh /bitnami/zookeeper/data/version-2/
-rw-r--r-- 1 1001 1001 129M Nov 10 16:42 log.1c4167f
-rw-r--r-- 1 1001 1001 65M Nov 10 17:00 log.1c58844
-rw-r--r-- 1 1001 1001 1.0G Nov 10 17:28 log.1c78fc6
-rw-r--r-- 1 1001 1001 1.9G Nov 10 17:44 log.1c86b58
-rw-r--r-- 1 1001 1001 2.2G Nov 10 18:32 log.1cc3e55 ← Largest log
Total ZK data directory: 5.4GB
Despite autopurge being enabled (ZOO_AUTOPURGE_INTERVAL=1, ZOO_SNAPCOUNT=100000), individual transaction logs grow to 2.2GB before snapshots are taken. Not sure if it is relevant to the performance degrade.
Configuration
Ingestion Job Spec:
jobType: SegmentCreationAndMetadataPush
pushJobSpec:
pushAttempts: 10
pushRetryIntervalMillis: 2000
pushFileNamePattern: 'glob:**post_metrics_OFFLINE_*_*_$FILE_PADDED_*.tar.gz'
Kubernetes Job:
-
completions: 100 -
parallelism: 100 -
completionMode: Indexed
Each worker processes 500 files sequentially, creating one segment per file.
Questions
-
Is this a known limitation of single-ZK deployments with high-parallelism ingestion?
-
What is the recommended parallelism architecture for metadata push operations to avoid ZK contention?
Thank you!
During our operation of the cluster, we have never run into similar problems. The ZK log size is definitely higher than expected given the segment count (we have large tables with more than 500K segments). Could you please try a larger instance, or check the ZK settings?
cc @xiangfu0 @KKcorps for extra input