foundationdb icon indicating copy to clipboard operation
foundationdb copied to clipboard

Batch GRV Rate Limit Exceeded is not always thrown

Open ScottDugas opened this issue 1 year ago • 2 comments

The tests noted in https://github.com/FoundationDB/fdb-record-layer/issues/2813 will occasionally run forever due to this code: https://github.com/FoundationDB/fdb-record-layer/blob/200ac05041a1af712f621a27b4c5c37f9eab001c/fdb-record-layer-core/src/main/java/com/apple/foundationdb/record/provider/foundationdb/storestate/FDBRecordStoreStateCacheEntry.java#L97-L100

Where it is combining two futures. The first one: recordStore.loadRecordStoreStateAsync is doing a regular read. The second one is doing a snapshot get of SystemKeyspace.METADATA_VERSION_KEY.

The first future fails with Batch GRV request rate limit exceeded (code 1051). The second future never completes.

I have tried to reproduce this in a more isolated environment, but it is proving tricky to get it to reliably start failing with Batch GRV request rate limit exceeded.

ScottDugas avatar Jul 11 '24 15:07 ScottDugas

ok, I created a reasonable reproduction at: https://github.com/FoundationDB/fdb-record-layer/pull/2823/files About half the time, it fails with timeouts just for the reads of SystemKeyspace.METADATA_VERSION_KEY and Batch GRV request rate limit exceeded for the other operations. The other times it will fail with timeouts for all the operations.

ScottDugas avatar Jul 15 '24 14:07 ScottDugas

I think we have stampled across this issue in simulation. We have a very basic RL's fork in Rust that we can simulate as an external workload. We found this morning a specific seed (5267156628) that is failing the same way, as transaction.get_metadata_version is hanging.

FoundationDB 7.3 (v7.3.43)
source version 412531b5c97fa84343da94888cc949a4d29e8c29
protocol fdb00b073000000

Our testfile looks like this:

[[test]]
testTitle = 'QuotaWorkload'

[[test.workload]]
testName = 'External'
libraryName = 'ldb'
workloadName = 'QuotaWorkload'
libraryPath = './target/release'
iteration_count = 50

[[test.workload]]
testName = 'RandomClogging'
testDuration = 30.0
swizzle = 1

[[test.workload]]
testName = 'Attrition'
machinesToKill = 10
machinesToLeave = 3
reboot = true
testDuration = 30.0

[[test.workload]]
testName = 'Rollback'
testDuration = 30

[[test.workload]]
testName = 'ChangeConfig'
maxDelayBeforeChange = 30.0
coordinators = 'auto'

Let me know if we can help :smile:

PierreZ avatar Aug 09 '24 08:08 PierreZ