pinot icon indicating copy to clipboard operation
pinot copied to clipboard

ZooKeeper Contention During Batch Ingestion

Open qswawrq opened this issue 4 months ago • 1 comments

ZooKeeper Contention and Linear Performance Degradation with High-Parallelism Batch Ingestion

Environment

  • Pinot Version: apachepinot/pinot:latest
  • Deployment: Kubernetes
  • Cluster Size: 1 Controller, 1 ZooKeeper, 3 Brokers, 3 Servers
  • ZooKeeper Config:
    • ZOO_SNAPCOUNT=100000
    • ZOO_AUTOPURGE_INTERVAL=1
    • ZOO_AUTOPURGE_RETAIN_COUNT=5
    • Heap: 1GB
    • Storage: 20GB used

Problem Description

We're experiencing severe ZooKeeper contention and linear performance degradation during batch ingestion of 50,000 parquet files (each around 350 MB) using 100 parallel Kubernetes Job workers with jobType: SegmentCreationAndMetadataPush.

Performance Degradation Timeline

Time Elapsed Files/Worker Total Segments Time per File Degradation Factor
Initial 0-5 500 2 minutes 1x (baseline)
+3 hours 8-9 1,002 20 minutes 10x
+30 hours 43-44 4,438 88 minutes 44x
+50 hours 73-76 6,150 175 minutes 87x

Performance continues to degrade linearly as segment count grows.

Critical Finding: We tested with 5, 20 and 100 workers. Linear degradation occurs at all parallelism levels, indicating the bottleneck is not just concurrency, but the O(n) cost of reading/writing the growing segment list. It seems the ingestion workers take a little time to build out the segments locally but spend 99% time on trying to upload metadata.

Observed Behavior

1. ZooKeeper Metadata Update Failures (17% Error Rate)

Segment uploads frequently fail with optimistic locking errors. In the last 10,000 log lines, we observed 111 ZK version conflicts out of 661 upload attempts (16.8% failure rate):

2025/11/09 15:47:07.055 ERROR [ZKOperator] [jersey-server-managed-async-executor-128] 
Failed to update ZK metadata for segment: post_metrics_OFFLINE_20315_20344_000000021514_0, 
table: post_metrics_OFFLINE, expected version: 3428

2025/11/09 15:47:07.055 ERROR [PinotSegmentUploadDownloadRestletResource] [jersey-server-managed-async-executor-128] 
Exception while uploading segment: Failed to update ZK metadata for segment: post_metrics_OFFLINE_20315_20344_000000021514_0, 
table: post_metrics_OFFLINE, expected version: 3428
java.lang.RuntimeException: Failed to update ZK metadata for segment: post_metrics_OFFLINE_20315_20344_000000021514_0, 
table: post_metrics_OFFLINE, expected version: 3428
	at org.apache.pinot.controller.api.upload.ZKOperator.processExistingSegment(ZKOperator.java:341)
	at org.apache.pinot.controller.api.upload.ZKOperator.completeSegmentOperations(ZKOperator.java:120)
	at org.apache.pinot.controller.api.resources.PinotSegmentUploadDownloadRestletResource.uploadSegment(PinotSegmentUploadDownloadRestletResource.java:433)

2. Large ZooKeeper Transaction Logs

ZooKeeper transaction logs have grown to multi-GB sizes:

$ kubectl exec pinot-zookeeper-0 -- ls -lh /bitnami/zookeeper/data/version-2/

-rw-r--r-- 1 1001 1001 129M Nov 10 16:42 log.1c4167f
-rw-r--r-- 1 1001 1001  65M Nov 10 17:00 log.1c58844
-rw-r--r-- 1 1001 1001 1.0G Nov 10 17:28 log.1c78fc6
-rw-r--r-- 1 1001 1001 1.9G Nov 10 17:44 log.1c86b58
-rw-r--r-- 1 1001 1001 2.2G Nov 10 18:32 log.1cc3e55  ← Largest log

Total ZK data directory: 5.4GB

Despite autopurge being enabled (ZOO_AUTOPURGE_INTERVAL=1, ZOO_SNAPCOUNT=100000), individual transaction logs grow to 2.2GB before snapshots are taken. Not sure if it is relevant to the performance degrade.

Configuration

Ingestion Job Spec:

jobType: SegmentCreationAndMetadataPush

pushJobSpec:
  pushAttempts: 10
  pushRetryIntervalMillis: 2000
  pushFileNamePattern: 'glob:**post_metrics_OFFLINE_*_*_$FILE_PADDED_*.tar.gz'

Kubernetes Job:

  • completions: 100
  • parallelism: 100
  • completionMode: Indexed

Each worker processes 500 files sequentially, creating one segment per file.

Questions

  1. Is this a known limitation of single-ZK deployments with high-parallelism ingestion?

  2. What is the recommended parallelism architecture for metadata push operations to avoid ZK contention?

Thank you!

qswawrq avatar Nov 10 '25 19:11 qswawrq

During our operation of the cluster, we have never run into similar problems. The ZK log size is definitely higher than expected given the segment count (we have large tables with more than 500K segments). Could you please try a larger instance, or check the ZK settings?

cc @xiangfu0 @KKcorps for extra input

Jackie-Jiang avatar Nov 18 '25 00:11 Jackie-Jiang