Transaction mutations partially applied when TiKV nodes crashing?
Cluster version: 8.5.1 Client version:
name = "tikv-client"
version = "0.3.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "048968e4e3d04db472346770cc19914c6b5ae206fa44677f6a0874d54cd05940"
I have a workload where am ingesting lots of data into TiKV, one transaction after another in quick succession. In one transaction I write/delete many KVs, and in a later transaction (in this specific case ~24h later) I try to read one of the keys I wrote to but no value is found, and I did not do another operation to the key in between. Other keys I wrote within the transaction which wrote the missing KV are found. Some of the TiKV nodes are crashing around the time due to the heavy workload. The missing keys are not deterministic, but when we repeat the workload we again see cases where some writes or deletes within a transaction do not seem to be applied. When no nodes crash during the workload we don't seem to have the issue (could be coincidence).
.put(key)within transaction (default optimistic)- Begin committing transaction writing the KV
- 10-20 or so logs about connecting to specifically one of the nodes which is likely crashed/restarting, these 2 repeating:
tikv_client::pd::client[0m[2m:[0m connect to tikv endpoint: <node 0>
tikv_client::common::security[0m[2m:[0m connect to rpc server at endpoint: <node 0>
- Client returns the commit was successful
Not related to this specific transaction but to give some context on what else was happening: ~30 seconds later, a different transaction submitted later gets a failed to commit secondary keys due to TxnLockNotFound, but it was not the same transaction as the one which wrote the missing KV. ~2 mins after that we see a heart beat error for TxnNotFound. Then a transaction errors with gRPC api error: status: Cancelled, message: "Timeout expired", details: [], metadata: MetadataMap { headers: {} }.
Here are the logs from the cluster around (30s either side) the time I called .commit() on the transaction which was supposed to write this missing KV (I called commit at 2025-07-12T19:28:37.261):
The missing key was 03ee44790100f9465cf9b0426aa7020a0685c66bd10a859b78e2da1dea940ba5e113600083b37e051e0b67eb17348f48d3e6e6e8b009d8ea48854d721488667d6000000000029fd4fc727f8594309b02da47145f019993219e0db644bdfd81ea6890e0c717836bdc370000000000000001 and it was written in transaction with start ts (Timestamp { physical: 1752348517228, logical: 14, suffix_bits: 0 }) and commit ts Timestamp { physical: 1752348517328, logical: 35, suffix_bits: 0 }.
How do I go about debugging this, or what information can I provide to help? I'm not sure if this is client or something else.
What's the cause of TiKV crash ?
Pretty sure OOM. To try debug we disabled GC so we could look at the MVCC records, but in that case the memory usage was more reasonable and we did not experience any crashes and we didn't notice any issues with the data. So I guess the GC was a big part of the memory usage. Do we lose some guarantee if we OOM?
In the node which restarted around the time of the transaction (0):
[2025/07/12 19:27:53.248 +00:00] [INFO] [lib.rs:91] ["Welcome to TiKV"] [thread_id=1]
...
[2025/07/12 19:27:53.417 +00:00] [WARN] [pipe_builder.rs:475] ["Truncating log file due to broken batch (queue=Rewrite,seq=1,offset=14057): Corruption: Log item offset is smaller than log batch header length"] [thread_id=19]
[2025/07/12 19:28:29.647 +00:00] [WARN] [pipe_builder.rs:475] ["Truncating log file due to broken batch (queue=Append,seq=886,offset=12137510): Corruption: Checksum expected 537067730 but got 2474708232"] [thread_id=21]
[2025/07/12 19:28:34.860 +00:00] [INFO] [engine.rs:93] ["Recovering raft logs takes 41.44691991s"] [thread_id=1]
...
[2025/07/12 19:28:54.596 +00:00] [WARN] [store.rs:1967] ["set thread priority for raftstore failed"] [error="Os { code: 13, kind: PermissionDenied, message: \"Permission denied\" }"] [thread_id=1]
...
[2025/07/12 19:29:29.643 +00:00] [ERROR] [time.rs:375] ["monotonic time jumped back, 3756488.784 -> 3756402.429"] [thread_id=155]
[2025/07/12 19:29:56.133 +00:00] [ERROR] [time.rs:375] ["monotonic time jumped back, 3756515.273 -> 3756402.429"] [thread_id=154]
The time at which I called commit was 19:28:37.261
The current information is not enough to address the issue. Can you provide the following to help reproducing the issue:
- The minimal reproducible client codes.
- The deployment of TiKV clusters, including configs of TiKV, PD, and TiDB (if there is any).
The OOM of TiKV should be another problem. You can submit another issue to TiKV repo. It will be great help if you can collect & attach metrics and logs with the issue. PingCAP Clinic can be used to collect them.
@pingyu here is the Kubernetes deployment we're running:
apiVersion: pingcap.com/v1alpha1
kind: TidbCluster
metadata:
name: tidb
spec:
version: v8.5.1
timezone: UTC
pvReclaimPolicy: Delete
enableDynamicConfiguration: true
configUpdateStrategy: RollingUpdate
discovery: {}
helper:
image: alpine:3
pd:
baseImage: pingcap/pd
replicas: 2
storageClassName: pd-write-optimized
requests:
memory: 2Gi
cpu: "2"
storage: 100Gi
limits:
memory: 2Gi
# It should never use anywhere near this much, but we want to give it whatever it needs
cpu: "16"
affinity:
podAntiAffinity:
# Prefer not to schedule on the same node as another cluster component
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchExpressions:
- key: app.kubernetes.io/component
operator: In
values:
- tidb
- tikv
- key: app.kubernetes.io/instance
operator: In
values:
- tidb
topologyKey: kubernetes.io/hostname
weight: 100
# Don't schedule on the same host as another PD instance
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app.kubernetes.io/component
operator: In
values:
- pd
- key: app.kubernetes.io/instance
operator: In
values:
- tidb
topologyKey: kubernetes.io/hostname
config: ""
tikv:
baseImage: pingcap/tikv
replicas: 3
storageClassName: pd-write-optimized
requests:
memory: 16Gi
cpu: "4"
storage: 750Gi
limits:
memory: 16Gi
# Effectively unlimited
cpu: "16"
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: gomaestro.org/node-role
operator: In
values:
- tikv
podAntiAffinity:
# Prefer not to schedule on node in same zone as another cluster component
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app.kubernetes.io/component
operator: In
values:
- tikv
- key: app.kubernetes.io/instance
operator: In
values:
- tidb
topologyKey: topology.kubernetes.io/zone
# Don't schedule on the same host as another TiKV instance
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app.kubernetes.io/component
operator: In
values:
- tikv
- key: app.kubernetes.io/instance
operator: In
values:
- tidb
topologyKey: kubernetes.io/hostname
tolerations:
- key: gomaestro.org/node-role
operator: Equal
value: tikv
effect: NoSchedule
config: |
[raftstore]
raft-entry-max-size = "40MB"
[storage]
scheduler-pending-write-threshold = "256MB"
scheduler-worker-pool-size = 8
# TIDB is only needed to trigger GC in TIKV, so we aren't concerned about HA
tidb:
baseImage: pingcap/tidb
maxFailoverCount: 0
replicas: 1
service:
type: ClusterIP
config: |
[log]
[log.file]
max-backups = 3
Our workflow is quite involved so it might be difficult to create a minimal reproducable example, so we want to avoid that if possible. We added some code so that after each transaction, we immediately do another read transaction to confirm the mutations were applied, and warn if they were not. We had an example where the transaction returns a successful commit, but then some of the mutations were not applied (35/299 or something). There was an error about heartbeat and an error about committing the secondary keys:
[2m2025-08-15T03:09:07.446617Z[0m [32m INFO[0m [1mstage[0m[1m{[0m[3mstage[0m[2m=[0m"tikv"[1m}[0m[2m:[0m[1mexecute[0m[2m:[0m [2mstorage::tikv[0m[2m:[0m flushing batch with 299 actions (Timestamp { physical: 1755227347399, logical: 55, suffix_bits: 0 })
[2m2025-08-15T03:09:07.447201Z[0m [32m INFO[0m [1mstage[0m[1m{[0m[3mstage[0m[2m=[0m"tikv"[1m}[0m[2m:[0m[1mexecute[0m[2m:[0m [2mstorage::tikv[0m[2m:[0m committing batch
...
[2m2025-08-15T03:09:17.451325Z[0m [31mERROR[0m [1mstage[0m[1m{[0m[3mstage[0m[2m=[0m"tikv"[1m}[0m[2m:[0m [2mtikv_client::transaction::transaction[0m[2m:[0m Error: While sending heartbeat. Multiple key errors: [KeyError(KeyError { locked: None, retryable: "", abort: "", conflict: None, already_exist: None, deadlock: None, commit_ts_expired: None, txn_not_found: Some(TxnNotFound { start_ts: 460122317756563511, primary_key: [3, 238, 68, 122, 0, 0, 43, 130, 216, 24, 88, 33, 131, 88, 28, 54, 160, 54, 198, 110, 203, 208, 222, 8, 115, 110, 21, 158, 163, 145, 0, 169, 103, 170, 192, 187, 3, 44, 107, 217, 255, 41, 78, 160, 0, 26, 166, 44, 251, 90, 96, 0, 0, 0, 0, 0, 63, 23, 159, 145, 43, 194, 155, 221, 95, 143, 8, 213, 125, 124, 80, 224, 44, 249, 22, 47, 133, 141, 71, 53, 11, 245, 116, 17, 164, 164, 228, 61, 103, 204, 198, 0, 0, 0, 0, 0, 0, 0, 2] }), commit_ts_too_large: None, assertion_failed: None })]
...
[2m2025-08-15T03:09:17.912338Z[0m [32m INFO[0m [1mstage[0m[1m{[0m[3mstage[0m[2m=[0m"tikv"[1m}[0m[2m:[0m[1mexecute[0m[2m:[0m [2mstorage::tikv[0m[2m:[0m finished committing: Some(Timestamp { physical: 1755227357898, logical: 43, suffix_bits: 0 })
[2m2025-08-15T03:09:17.995868Z[0m [33m WARN[0m [1mstage[0m[1m{[0m[3mstage[0m[2m=[0m"tikv"[1m}[0m[2m:[0m [2mtikv_client::transaction::transaction[0m[2m:[0m Failed to commit secondary keys: Multiple errors: [StringError("Multiple key errors: [KeyError(KeyError { locked: None, retryable: \"Error(Txn(Error(Mvcc(Error(TxnLockNotFound { start_ts: TimeStamp(460122317756563511), commit_ts: TimeStamp(460122320508813355), key: [3, 238, 68, 122, 0, 0, ...] })))))\", abort: \"\", conflict: None, already_exist: None, deadlock: None, commit_ts_expired: None, txn_not_found: None, commit_ts_too_large: None, assertion_failed: None })]")]
[2m2025-08-15T03:09:18.304843Z[0m [31mERROR[0m [1mstage[0m[1m{[0m[3mstage[0m[2m=[0m"tikv"[1m}[0m[2m:[0m[1mexecute[0m[2m:[0m [2mstorage::tikv[0m[2m:[0m paranoid mode verification failed with 35 errors detected
[2m2025-08-15T03:09:18.304861Z[0m [31mERROR[0m [1mstage[0m[1m{[0m[3mstage[0m[2m=[0m"tikv"[1m}[0m[2m:[0m[1mexecute[0m[2m:[0m [2mstorage::tikv[0m[2m:[0m 1: Key [03ee447...] should be deleted but has value: [8282d818584...]
...
[2m2025-08-15T03:09:18.304996Z[0m [31mERROR[0m [1mstage[0m[1m{[0m[3mstage[0m[2m=[0m"tikv"[1m}[0m[2m:[0m[1mexecute[0m[2m:[0m [2mstorage::tikv[0m[2m:[0m 35: Key [03ee447a00004c82d81...] should have value [8282d8185842835...] but doesn't exist
...
What do these heartbeat and secondary key warnings suggest could be happening?
The finished committing: Some(Timestamp { physical: 1755227357898, logical: 43, suffix_bits: 0 }) is printed when commit() returns success.
We had purposefully restricted the resources of the cluster as the issue seems to only happen when the cluster is stressed (but maybe not necessarily when nodes are going offline/OOM as initially thought).