dang-stripe

Results 12 comments of dang-stripe

Thanks for following up - we already fixed the segments by deleting them so can't easily reproduce atm. I looked at the code and it seems like we'd hit this...

@Jackie-Jiang Is it possible to make the metadata push job or API call to block until the segment has successfully been added to idealstate to avoid this issue? I think...

We're using this class: `org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentMetadataPushJobRunner` Based on the code, it seems like the job would correctly fail when a push fails. We no longer have the job logs around from...

This issue happened again, but we were able to get more logs this time. It seems like: 1. Job sends segment upload request controller 2. Controller processes upload, adds segment...

Also possibly when the push job retries the request since it'll have received a 500, it'll get a 200 because the upload request finds the segments already exist, causing the...

Instance partitions with `numPartitions=0` and `numInstancesPerPartition=0`: ```json "partitionToInstancesMap": { "0_0": [ "testinstance-uswest2b-4", "testinstance-uswest2b-5", "testinstance-uswest2b-6", "testinstance-uswest2b-1", "testinstance-uswest2b-2", "testinstance-uswest2b-3", // 7 was added when rebalancing for the scale up "testinstance-uswest2b-7" ], "0_1":...

@priyen-stripe could you include a sample query that was seeing this behavior?

@klsince We do not see that error in logs. After looking deeper, I do see that the Helix ZK client is struggling to maintain a persistent connection with ZK. ```...

@klsince FYI I paired with Jackie and we narrowed it down to high GC on the server causing the ZK disconnects. I filed https://github.com/apache/pinot/issues/14301 as a follow up. Going to...

We were able to repro this for MSE by doing a kill -9 on the process. Once the server comes up, we see queries fail for 2-3 minutes. We did...