client-rust Transaction mutations partially applied when TiKV nodes crashing?

Cluster version: 8.5.1 Client version:

name = "tikv-client"
version = "0.3.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "048968e4e3d04db472346770cc19914c6b5ae206fa44677f6a0874d54cd05940"

I have a workload where am ingesting lots of data into TiKV, one transaction after another in quick succession. In one transaction I write/delete many KVs, and in a later transaction (in this specific case ~24h later) I try to read one of the keys I wrote to but no value is found, and I did not do another operation to the key in between. Other keys I wrote within the transaction which wrote the missing KV are found. Some of the TiKV nodes are crashing around the time due to the heavy workload. The missing keys are not deterministic, but when we repeat the workload we again see cases where some writes or deletes within a transaction do not seem to be applied. When no nodes crash during the workload we don't seem to have the issue (could be coincidence).

.put(key) within transaction (default optimistic)
Begin committing transaction writing the KV
10-20 or so logs about connecting to specifically one of the nodes which is likely crashed/restarting, these 2 repeating:

tikv_client::pd::client[0m[2m:[0m connect to tikv endpoint: <node 0> 
tikv_client::common::security[0m[2m:[0m connect to rpc server at endpoint: <node 0>

Client returns the commit was successful

Not related to this specific transaction but to give some context on what else was happening: ~30 seconds later, a different transaction submitted later gets a failed to commit secondary keys due to TxnLockNotFound, but it was not the same transaction as the one which wrote the missing KV. ~2 mins after that we see a heart beat error for TxnNotFound. Then a transaction errors with gRPC api error: status: Cancelled, message: "Timeout expired", details: [], metadata: MetadataMap { headers: {} }.

Here are the logs from the cluster around (30s either side) the time I called .commit() on the transaction which was supposed to write this missing KV (I called commit at 2025-07-12T19:28:37.261):

logs_adjusted.json

The missing key was 03ee44790100f9465cf9b0426aa7020a0685c66bd10a859b78e2da1dea940ba5e113600083b37e051e0b67eb17348f48d3e6e6e8b009d8ea48854d721488667d6000000000029fd4fc727f8594309b02da47145f019993219e0db644bdfd81ea6890e0c717836bdc370000000000000001 and it was written in transaction with start ts (Timestamp { physical: 1752348517228, logical: 14, suffix_bits: 0 }) and commit ts Timestamp { physical: 1752348517328, logical: 35, suffix_bits: 0 }.

How do I go about debugging this, or what information can I provide to help? I'm not sure if this is client or something else.

Jul 14 '25 18:07 jmhrpr

What's the cause of TiKV crash ?

Jul 15 '25 01:07 pingyu

Pretty sure OOM. To try debug we disabled GC so we could look at the MVCC records, but in that case the memory usage was more reasonable and we did not experience any crashes and we didn't notice any issues with the data. So I guess the GC was a big part of the memory usage. Do we lose some guarantee if we OOM?

In the node which restarted around the time of the transaction (0):

[2025/07/12 19:27:53.248 +00:00] [INFO] [lib.rs:91] ["Welcome to TiKV"] [thread_id=1]
...
[2025/07/12 19:27:53.417 +00:00] [WARN] [pipe_builder.rs:475] ["Truncating log file due to broken batch (queue=Rewrite,seq=1,offset=14057): Corruption: Log item offset is smaller than log batch header length"] [thread_id=19]
[2025/07/12 19:28:29.647 +00:00] [WARN] [pipe_builder.rs:475] ["Truncating log file due to broken batch (queue=Append,seq=886,offset=12137510): Corruption: Checksum expected 537067730 but got 2474708232"] [thread_id=21]
[2025/07/12 19:28:34.860 +00:00] [INFO] [engine.rs:93] ["Recovering raft logs takes 41.44691991s"] [thread_id=1]
...
[2025/07/12 19:28:54.596 +00:00] [WARN] [store.rs:1967] ["set thread priority for raftstore failed"] [error="Os { code: 13, kind: PermissionDenied, message: \"Permission denied\" }"] [thread_id=1]
...
[2025/07/12 19:29:29.643 +00:00] [ERROR] [time.rs:375] ["monotonic time jumped back, 3756488.784 -> 3756402.429"] [thread_id=155]
[2025/07/12 19:29:56.133 +00:00] [ERROR] [time.rs:375] ["monotonic time jumped back, 3756515.273 -> 3756402.429"] [thread_id=154]

The time at which I called commit was 19:28:37.261

Jul 15 '25 11:07 jmhrpr

The current information is not enough to address the issue. Can you provide the following to help reproducing the issue:

The minimal reproducible client codes.
The deployment of TiKV clusters, including configs of TiKV, PD, and TiDB (if there is any).

The OOM of TiKV should be another problem. You can submit another issue to TiKV repo. It will be great help if you can collect & attach metrics and logs with the issue. PingCAP Clinic can be used to collect them.

Jul 20 '25 02:07 pingyu

@pingyu here is the Kubernetes deployment we're running:

apiVersion: pingcap.com/v1alpha1
kind: TidbCluster
metadata:
  name: tidb
spec:
  version: v8.5.1
  timezone: UTC
  pvReclaimPolicy: Delete
  enableDynamicConfiguration: true
  configUpdateStrategy: RollingUpdate
  discovery: {}
  helper:
    image: alpine:3
  pd:
    baseImage: pingcap/pd
    replicas: 2
    storageClassName: pd-write-optimized
    requests:
      memory: 2Gi
      cpu: "2"
      storage: 100Gi
    limits:
      memory: 2Gi
      # It should never use anywhere near this much, but we want to give it whatever it needs
      cpu: "16"
    affinity:
      podAntiAffinity:
        # Prefer not to schedule on the same node as another cluster component
        preferredDuringSchedulingIgnoredDuringExecution:
          - podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app.kubernetes.io/component
                  operator: In
                  values:
                  - tidb
                  - tikv
                - key: app.kubernetes.io/instance
                  operator: In
                  values:
                  - tidb
              topologyKey: kubernetes.io/hostname
            weight: 100
        # Don't schedule on the same host as another PD instance
        requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app.kubernetes.io/component
                operator: In
                values:
                - pd
              - key: app.kubernetes.io/instance
                operator: In
                values:
                - tidb
            topologyKey: kubernetes.io/hostname
    config: ""
  tikv:
    baseImage: pingcap/tikv
    replicas: 3
    storageClassName: pd-write-optimized
    requests:
      memory: 16Gi
      cpu: "4"
      storage: 750Gi
    limits:
      memory: 16Gi
      # Effectively unlimited
      cpu: "16"
    affinity:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
          - matchExpressions:
            - key: gomaestro.org/node-role
              operator: In
              values:
              - tikv
      podAntiAffinity:
        # Prefer not to schedule on node in same zone as another cluster component
        preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100
          podAffinityTerm:
            labelSelector:
              matchExpressions:
              - key: app.kubernetes.io/component
                operator: In
                values:
                - tikv
              - key: app.kubernetes.io/instance
                operator: In
                values:
                - tidb
            topologyKey: topology.kubernetes.io/zone
        # Don't schedule on the same host as another TiKV instance
        requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchExpressions:
            - key: app.kubernetes.io/component
              operator: In
              values:
              - tikv
            - key: app.kubernetes.io/instance
              operator: In
              values:
              - tidb
          topologyKey: kubernetes.io/hostname
    tolerations:
    - key: gomaestro.org/node-role
      operator: Equal
      value: tikv
      effect: NoSchedule
    config: |
      [raftstore]
        raft-entry-max-size = "40MB"

      [storage]
        scheduler-pending-write-threshold = "256MB"
        scheduler-worker-pool-size = 8
  # TIDB is only needed to trigger GC in TIKV, so we aren't concerned about HA
  tidb:
    baseImage: pingcap/tidb
    maxFailoverCount: 0
    replicas: 1
    service:
      type: ClusterIP
    config: |
      [log]
        [log.file]
          max-backups = 3

Jul 21 '25 15:07 Vardominator

Our workflow is quite involved so it might be difficult to create a minimal reproducable example, so we want to avoid that if possible. We added some code so that after each transaction, we immediately do another read transaction to confirm the mutations were applied, and warn if they were not. We had an example where the transaction returns a successful commit, but then some of the mutations were not applied (35/299 or something). There was an error about heartbeat and an error about committing the secondary keys:

[2m2025-08-15T03:09:07.446617Z[0m [32m INFO[0m [1mstage[0m[1m{[0m[3mstage[0m[2m=[0m"tikv"[1m}[0m[2m:[0m[1mexecute[0m[2m:[0m [2mstorage::tikv[0m[2m:[0m flushing batch with 299 actions (Timestamp { physical: 1755227347399, logical: 55, suffix_bits: 0 })
[2m2025-08-15T03:09:07.447201Z[0m [32m INFO[0m [1mstage[0m[1m{[0m[3mstage[0m[2m=[0m"tikv"[1m}[0m[2m:[0m[1mexecute[0m[2m:[0m [2mstorage::tikv[0m[2m:[0m committing batch
...
[2m2025-08-15T03:09:17.451325Z[0m [31mERROR[0m [1mstage[0m[1m{[0m[3mstage[0m[2m=[0m"tikv"[1m}[0m[2m:[0m [2mtikv_client::transaction::transaction[0m[2m:[0m Error: While sending heartbeat. Multiple key errors: [KeyError(KeyError { locked: None, retryable: "", abort: "", conflict: None, already_exist: None, deadlock: None, commit_ts_expired: None, txn_not_found: Some(TxnNotFound { start_ts: 460122317756563511, primary_key: [3, 238, 68, 122, 0, 0, 43, 130, 216, 24, 88, 33, 131, 88, 28, 54, 160, 54, 198, 110, 203, 208, 222, 8, 115, 110, 21, 158, 163, 145, 0, 169, 103, 170, 192, 187, 3, 44, 107, 217, 255, 41, 78, 160, 0, 26, 166, 44, 251, 90, 96, 0, 0, 0, 0, 0, 63, 23, 159, 145, 43, 194, 155, 221, 95, 143, 8, 213, 125, 124, 80, 224, 44, 249, 22, 47, 133, 141, 71, 53, 11, 245, 116, 17, 164, 164, 228, 61, 103, 204, 198, 0, 0, 0, 0, 0, 0, 0, 2] }), commit_ts_too_large: None, assertion_failed: None })]
...
[2m2025-08-15T03:09:17.912338Z[0m [32m INFO[0m [1mstage[0m[1m{[0m[3mstage[0m[2m=[0m"tikv"[1m}[0m[2m:[0m[1mexecute[0m[2m:[0m [2mstorage::tikv[0m[2m:[0m finished committing: Some(Timestamp { physical: 1755227357898, logical: 43, suffix_bits: 0 })
[2m2025-08-15T03:09:17.995868Z[0m [33m WARN[0m [1mstage[0m[1m{[0m[3mstage[0m[2m=[0m"tikv"[1m}[0m[2m:[0m [2mtikv_client::transaction::transaction[0m[2m:[0m Failed to commit secondary keys: Multiple errors: [StringError("Multiple key errors: [KeyError(KeyError { locked: None, retryable: \"Error(Txn(Error(Mvcc(Error(TxnLockNotFound { start_ts: TimeStamp(460122317756563511), commit_ts: TimeStamp(460122320508813355), key: [3, 238, 68, 122, 0, 0, ...] })))))\", abort: \"\", conflict: None, already_exist: None, deadlock: None, commit_ts_expired: None, txn_not_found: None, commit_ts_too_large: None, assertion_failed: None })]")]
[2m2025-08-15T03:09:18.304843Z[0m [31mERROR[0m [1mstage[0m[1m{[0m[3mstage[0m[2m=[0m"tikv"[1m}[0m[2m:[0m[1mexecute[0m[2m:[0m [2mstorage::tikv[0m[2m:[0m paranoid mode verification failed with 35 errors detected
[2m2025-08-15T03:09:18.304861Z[0m [31mERROR[0m [1mstage[0m[1m{[0m[3mstage[0m[2m=[0m"tikv"[1m}[0m[2m:[0m[1mexecute[0m[2m:[0m [2mstorage::tikv[0m[2m:[0m   1: Key [03ee447...] should be deleted but has value: [8282d818584...]
...
[2m2025-08-15T03:09:18.304996Z[0m [31mERROR[0m [1mstage[0m[1m{[0m[3mstage[0m[2m=[0m"tikv"[1m}[0m[2m:[0m[1mexecute[0m[2m:[0m [2mstorage::tikv[0m[2m:[0m   35: Key [03ee447a00004c82d81...] should have value [8282d8185842835...] but doesn't exist
...

What do these heartbeat and secondary key warnings suggest could be happening?

The finished committing: Some(Timestamp { physical: 1755227357898, logical: 43, suffix_bits: 0 }) is printed when commit() returns success.

We had purposefully restricted the resources of the cluster as the issue seems to only happen when the cluster is stressed (but maybe not necessarily when nodes are going offline/OOM as initially thought).

Aug 25 '25 11:08 jmhrpr