atlasdb [PDS-111356] Possible Race Condition in Metric Deregistration

See internal issue PDS-111356 for context.

An internal product that utilises multiple TransactionManager instances reported failures talking to two Cassandra nodes. On further investigation, these failures were generated when polling for metrics, specifically the metrics for Targeted Sweep.

java.lang.IllegalStateException: {throwable0_message}
	at org.apache.commons.pool2.impl.BaseGenericObjectPool.assertOpen(BaseGenericObjectPool.java:672)
	at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:412)
	at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:363)
	at com.palantir.atlasdb.keyvalue.cassandra.CassandraClientPoolingContainer.runWithGoodResource(CassandraClientPoolingContainer.java:142)
	at com.palantir.atlasdb.keyvalue.cassandra.CassandraClientPoolingContainer.runWithPooledResource(CassandraClientPoolingContainer.java:109)
	at com.palantir.atlasdb.keyvalue.cassandra.CassandraClientPoolImpl.runWithPooledResourceRecordingMetrics(CassandraClientPoolImpl.java:457)
	at com.palantir.atlasdb.keyvalue.cassandra.CassandraClientPoolImpl.runWithRetryOnHost(CassandraClientPoolImpl.java:404)
	at com.palantir.atlasdb.keyvalue.cassandra.paging.RowGetter.getRows(RowGetter.java:56)
	at com.palantir.atlasdb.keyvalue.cassandra.paging.CassandraRangePagingIterable.getRows(CassandraRangePagingIterable.java:94)
	at com.palantir.atlasdb.keyvalue.cassandra.paging.CassandraRangePagingIterable.getSinglePage(CassandraRangePagingIterable.java:81)
	at com.palantir.atlasdb.keyvalue.cassandra.paging.CassandraRangePagingIterable.getFirstPage(CassandraRangePagingIterable.java:70)
	at com.palantir.atlasdb.keyvalue.cassandra.paging.CassandraRangePagingIterable.getFirstPage(CassandraRangePagingIterable.java:37)
	at com.palantir.util.paging.AbstractPagingIterable$PagingIterator.computeNext(AbstractPagingIterable.java:60)
	at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:141)
	at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:136)
	at com.palantir.common.base.ClosableIterators$1.hasNext(ClosableIterators.java:32)
	at com.palantir.atlasdb.cleaner.KeyValueServicePuncherStore.get(KeyValueServicePuncherStore.java:112)
	at com.palantir.atlasdb.cleaner.KeyValueServicePuncherStore.getMillisForTimestampIfNotPunchedBefore(KeyValueServicePuncherStore.java:169)
	at com.palantir.atlasdb.sweep.metrics.TargetedSweepMetrics.getMillisForTimestampBoundedAtOneWeek(TargetedSweepMetrics.java:75)
	at com.palantir.atlasdb.sweep.metrics.TargetedSweepMetrics.lambda$createWithClock$0(TargetedSweepMetrics.java:68)
	at com.palantir.atlasdb.sweep.metrics.TargetedSweepMetrics$MetricsForStrategy.estimateMillisSinceTs(TargetedSweepMetrics.java:178)
	at com.palantir.atlasdb.sweep.metrics.TargetedSweepMetrics$MetricsForStrategy.lambda$registerMillisSinceLastSweptMetric$0(TargetedSweepMetrics.java:165)
	at com.palantir.util.CachedComposedSupplier.recompute(CachedComposedSupplier.java:68)
	at com.palantir.util.CachedComposedSupplier.get(CachedComposedSupplier.java:59)
        <NEIN - REDACTED>
	at com.palantir.tracing.Tracers.lambda$wrapWithNewTrace$1(Tracers.java:364)
	at com.palantir.tritium.metrics.TaggedMetricsScheduledExecutorService$TaggedMetricsRunnable.run(TaggedMetricsScheduledExecutorService.java:189)
	at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
	at java.util.concurrent.FutureTask.runAndReset(Unknown Source)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.lang.Thread.run(Unknown Source)

Normally this is supposed to be de-registered when a TM is closed. Interestingly, there is a strong correlation with TMs being closed (because TMs are put in an expire-after-access Caffeine cache) and this message being logged. We see also that it's kind of transient in that the error rate goes down to effectively zero when the JVM is bounced.

This is non-deterministic: it is not always the case that the request rate increases whenever a TM is deregistered. I think it's likely to be some race condition in deregistering TMs and removing metrics (or possibly in the way gauge scheduling/reading in the presence of de-registration is handled in internal logging infrastructure). @gmaretic and I investigated for 20 mins and didn't find anything that would have manifested in a way consistent with the symptoms we're observing, though.

If I've parsed everything correctly, this is not a massive problem in that we are not actually making tons of consistently failing requests to Cassandra, though it does make the logs very noisy which is not ideal.

Feb 17 '20 16:02 jeremyk-91

Gauges are only registered/deregistered every 30 seconds in internal web server, so 30 seconds of bad logging makes sense. See internal class GaugeReporter

Feb 17 '20 19:02 jkozlowski

The presence of these log lines continues in perpetuity (see 2020-02-17 11.12.00 screenshot on ticket). 30 seconds of bad logging would be acceptable for that, but I don't think that's what is seen on the ticket.

Feb 17 '20 20:02 jeremyk-91

Interesting, internal skiing product doesn't pass top level metric registries, instead passes brand new registries (both tagged and dropwizard) and then registers them with the top level tagged registry behind a tag. ~~Does atlas have somewhere metric registration/deregistration?~~ Right, MetricsManager has that.

Feb 18 '20 09:02 jkozlowski

atlasdb atlasdb copied to clipboard

[PDS-111356] Possible Race Condition in Metric Deregistration

atlasdb
atlasdb copied to clipboard