ZooKeeper Contention During Batch Ingestion

Open qswawrq opened this issue 4 months ago • 1 comments

ZooKeeper Contention and Linear Performance Degradation with High-Parallelism Batch Ingestion

Environment

Pinot Version: apachepinot/pinot:latest
Deployment: Kubernetes
Cluster Size: 1 Controller, 1 ZooKeeper, 3 Brokers, 3 Servers
ZooKeeper Config:
- ZOO_SNAPCOUNT=100000
- ZOO_AUTOPURGE_INTERVAL=1
- ZOO_AUTOPURGE_RETAIN_COUNT=5
- Heap: 1GB
- Storage: 20GB used

Problem Description

We're experiencing severe ZooKeeper contention and linear performance degradation during batch ingestion of 50,000 parquet files (each around 350 MB) using 100 parallel Kubernetes Job workers with jobType: SegmentCreationAndMetadataPush.

Performance Degradation Timeline

Time Elapsed	Files/Worker	Total Segments	Time per File	Degradation Factor
Initial	0-5	500	2 minutes	1x (baseline)
+3 hours	8-9	1,002	20 minutes	10x
+30 hours	43-44	4,438	88 minutes	44x
+50 hours	73-76	6,150	175 minutes	87x

Performance continues to degrade linearly as segment count grows.

Critical Finding: We tested with 5, 20 and 100 workers. Linear degradation occurs at all parallelism levels, indicating the bottleneck is not just concurrency, but the O(n) cost of reading/writing the growing segment list. It seems the ingestion workers take a little time to build out the segments locally but spend 99% time on trying to upload metadata.

Observed Behavior

1. ZooKeeper Metadata Update Failures (17% Error Rate)

Segment uploads frequently fail with optimistic locking errors. In the last 10,000 log lines, we observed 111 ZK version conflicts out of 661 upload attempts (16.8% failure rate):

2025/11/09 15:47:07.055 ERROR [ZKOperator] [jersey-server-managed-async-executor-128] 
Failed to update ZK metadata for segment: post_metrics_OFFLINE_20315_20344_000000021514_0, 
table: post_metrics_OFFLINE, expected version: 3428

2025/11/09 15:47:07.055 ERROR [PinotSegmentUploadDownloadRestletResource] [jersey-server-managed-async-executor-128] 
Exception while uploading segment: Failed to update ZK metadata for segment: post_metrics_OFFLINE_20315_20344_000000021514_0, 
table: post_metrics_OFFLINE, expected version: 3428
java.lang.RuntimeException: Failed to update ZK metadata for segment: post_metrics_OFFLINE_20315_20344_000000021514_0, 
table: post_metrics_OFFLINE, expected version: 3428
	at org.apache.pinot.controller.api.upload.ZKOperator.processExistingSegment(ZKOperator.java:341)
	at org.apache.pinot.controller.api.upload.ZKOperator.completeSegmentOperations(ZKOperator.java:120)
	at org.apache.pinot.controller.api.resources.PinotSegmentUploadDownloadRestletResource.uploadSegment(PinotSegmentUploadDownloadRestletResource.java:433)

2. Large ZooKeeper Transaction Logs

ZooKeeper transaction logs have grown to multi-GB sizes:

$ kubectl exec pinot-zookeeper-0 -- ls -lh /bitnami/zookeeper/data/version-2/

-rw-r--r-- 1 1001 1001 129M Nov 10 16:42 log.1c4167f
-rw-r--r-- 1 1001 1001  65M Nov 10 17:00 log.1c58844
-rw-r--r-- 1 1001 1001 1.0G Nov 10 17:28 log.1c78fc6
-rw-r--r-- 1 1001 1001 1.9G Nov 10 17:44 log.1c86b58
-rw-r--r-- 1 1001 1001 2.2G Nov 10 18:32 log.1cc3e55  ← Largest log

Total ZK data directory: 5.4GB

Despite autopurge being enabled (ZOO_AUTOPURGE_INTERVAL=1, ZOO_SNAPCOUNT=100000), individual transaction logs grow to 2.2GB before snapshots are taken. Not sure if it is relevant to the performance degrade.

Configuration

Ingestion Job Spec:

jobType: SegmentCreationAndMetadataPush

pushJobSpec:
  pushAttempts: 10
  pushRetryIntervalMillis: 2000
  pushFileNamePattern: 'glob:**post_metrics_OFFLINE_*_*_$FILE_PADDED_*.tar.gz'

Kubernetes Job:

completions: 100
parallelism: 100
completionMode: Indexed

Each worker processes 500 files sequentially, creating one segment per file.

Questions

Is this a known limitation of single-ZK deployments with high-parallelism ingestion?
What is the recommended parallelism architecture for metadata push operations to avoid ZK contention?

Thank you!

Nov 10 '25 19:11 qswawrq

During our operation of the cluster, we have never run into similar problems. The ZK log size is definitely higher than expected given the segment count (we have large tables with more than 500K segments). Could you please try a larger instance, or check the ZK settings?

cc @xiangfu0 @KKcorps for extra input

Nov 18 '25 00:11 Jackie-Jiang