dang-stripe
dang-stripe
Thanks for following up - we already fixed the segments by deleting them so can't easily reproduce atm. I looked at the code and it seems like we'd hit this...
@Jackie-Jiang Is it possible to make the metadata push job or API call to block until the segment has successfully been added to idealstate to avoid this issue? I think...
We're using this class: `org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentMetadataPushJobRunner` Based on the code, it seems like the job would correctly fail when a push fails. We no longer have the job logs around from...
This issue happened again, but we were able to get more logs this time. It seems like: 1. Job sends segment upload request controller 2. Controller processes upload, adds segment...
Also possibly when the push job retries the request since it'll have received a 500, it'll get a 200 because the upload request finds the segments already exist, causing the...
Instance partitions with `numPartitions=0` and `numInstancesPerPartition=0`: ```json "partitionToInstancesMap": { "0_0": [ "testinstance-uswest2b-4", "testinstance-uswest2b-5", "testinstance-uswest2b-6", "testinstance-uswest2b-1", "testinstance-uswest2b-2", "testinstance-uswest2b-3", // 7 was added when rebalancing for the scale up "testinstance-uswest2b-7" ], "0_1":...
@priyen-stripe could you include a sample query that was seeing this behavior?
@klsince We do not see that error in logs. After looking deeper, I do see that the Helix ZK client is struggling to maintain a persistent connection with ZK. ```...
@klsince FYI I paired with Jackie and we narrowed it down to high GC on the server causing the ZK disconnects. I filed https://github.com/apache/pinot/issues/14301 as a follow up. Going to...
We were able to repro this for MSE by doing a kill -9 on the process. Once the server comes up, we see queries fail for 2-3 minutes. We did...