Dolt SQL Server Replica Pod Fails to Start After Restart with Nil Pointer Dereference Panic
I created two pods using the image dolthub/dolt-sql-server:1.76.2, where dolt-0 is the primary node and dolt-1 is the replica node.
After running for a period of time, the dolt-1 node restarted but consistently failed to start. The failure logs are as follows:
Defaulted container "insight-dolt" out of: insight-dolt, init-config (init)
Starting server with Config HP="0.0.0.0:6033"|T="28800000"|R="false"|L="info"
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x242dace]
goroutine 122 [running]:
github.com/dolthub/dolt/go/libraries/doltcore/doltdb.(*DoltDB).AccessMode(0x440cbd0?)
github.com/dolthub/dolt/go/libraries/doltcore/doltdb/doltdb.go:192 +0xe
github.com/dolthub/dolt/go/libraries/doltcore/env.(*DoltEnv).IsAccessModeReadOnly(0xc001076be0?, {0x440cbd0?, 0xc00107e690?})
github.com/dolthub/dolt/go/libraries/doltcore/env/environment.go:1321 +0x56
github.com/dolthub/dolt/go/cmd/dolt/commands/sqlserver.ConfigureServices.func5.1(...)
github.com/dolthub/dolt/go/cmd/dolt/commands/sqlserver/server.go:194
github.com/dolthub/dolt/go/libraries/doltcore/env.(*MultiRepoEnv).Iter(...)
github.com/dolthub/dolt/go/libraries/doltcore/env/multi_repo_env.go:299
github.com/dolthub/dolt/go/cmd/dolt/commands/sqlserver.ConfigureServices.func5({0x440cbd0, 0xc00107e690})
github.com/dolthub/dolt/go/cmd/dolt/commands/sqlserver/server.go:193 +0x7d
github.com/dolthub/dolt/go/libraries/utils/svcs.AnonService.Init(...)
github.com/dolthub/dolt/go/libraries/utils/svcs/controller.go:48
github.com/dolthub/dolt/go/libraries/utils/svcs.(*Controller).Start(0xc000eaecc0, {0x440cbd0, 0xc00107e690})
github.com/dolthub/dolt/go/libraries/utils/svcs/controller.go:221 +0x1bf
created by github.com/dolthub/dolt/go/cmd/dolt/commands/sqlserver.Serve in goroutine 1
github.com/dolthub/dolt/go/cmd/dolt/commands/sqlserver/server.go:102 +0x145
Unable to determine the root cause, I cleaned up the PVC used by dolt-1 and recreated dolt-1, which resolved the issue. Now, I would like to understand the cause of this problem and the proper solution.
My dolt has 4 databases, as follows:
root@dolt-0:/db/dolt# du -h -d 1
137M ./insight_metabase
1.1M ./insight_system
21M ./insight_jobadmin
3.2G ./insight_datawarehouse
5.5G ./.dolt
8.7G .
When I cleaned up the PVC of the replica node dolt-1 and remounted it, health checks failed during the data synchronization process from dolt-0 to dolt-1. The StatefulSet's health check configuration is as follows:
livenessProbe:
failureThreshold: 3
initialDelaySeconds: 30
periodSeconds: 10
successThreshold: 1
tcpSocket:
port: 6033
timeoutSeconds: 1
name: insight-dolt
ports:
- containerPort: 6033
name: dolt-port
protocol: TCP
- containerPort: 50051
name: dolt-cluster
protocol: TCP
readinessProbe:
failureThreshold: 3
initialDelaySeconds: 20
periodSeconds: 10
successThreshold: 1
tcpSocket:
port: 6033
timeoutSeconds: 1
resources:
limits:
cpu: "2"
memory: 8Gi
requests:
cpu: 50m
memory: 50Mi
After I cleaned up the PVC of dolt-1 again and restarted dolt-1, databases insight_jobadmin, insight_metabase, and insight_system synchronized normally, but insight_datawarehouse showed anomalies. The dolt_cluster_status results are as follows:
dolt-0
insight_datawarehouse/main*> select * from dolt_cluster.dolt_cluster_status;
+-----------------------+----------------+---------+-------+------------------------+---------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| database | standby_remote | role | epoch | replication_lag_millis | last_update | current_error |
+-----------------------+----------------+---------+-------+------------------------+---------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| insight_datawarehouse | standby | primary | 1 | NULL | NULL | failed to commit chunks on destDB: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp: lookup insight-dolt-1.insight-dolt-headless.sit.svc.cluster.local on 10.96.0.10:53: no such host" |
| insight_jobadmin | standby | primary | 1 | 0 | 2025-11-11 09:29:13 | NULL |
| insight_metabase | standby | primary | 1 | 659 | 2025-11-11 09:29:39 | NULL |
| insight_system | standby | primary | 1 | 0 | 2025-11-11 08:58:50 | NULL |
+-----------------------+----------------+---------+-------+------------------------+---------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
4 rows in set (0.00 sec)
dolt-1
insight_datawarehouse/main> select * from dolt_cluster.dolt_cluster_status;
+-----------------------+----------------+---------+-------+------------------------+---------------------+---------------+
| database | standby_remote | role | epoch | replication_lag_millis | last_update | current_error |
+-----------------------+----------------+---------+-------+------------------------+---------------------+---------------+
| insight_jobadmin | standby | standby | 1 | NULL | 2025-11-11 09:29:23 | NULL |
| insight_metabase | standby | standby | 1 | NULL | 2025-11-11 09:29:24 | NULL |
| insight_datawarehouse | standby | standby | 1 | NULL | NULL | NULL |
| insight_system | standby | standby | 1 | NULL | 2025-11-11 09:29:23 | NULL |
+-----------------------+----------------+---------+-------+------------------------+---------------------+---------------+
It seems that the database size might be affecting the master-replica synchronization?
It's very strange. After I took dolt-1 offline, the primary status is as follows:
insight_datawarehouse/main*> select * from dolt_cluster.dolt_cluster_status \G;
*************************** 1. row ***************************
database: insight_datawarehouse
standby_remote: standby
role: primary
epoch: 1
replication_lag_millis: NULL
last_update: NULL
current_error: failed to commit chunks on destDB: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp: lookup insight-dolt-1.insight-dolt-headless.sit.svc.cluster.local on 10.96.0.10:53: no such host"
*************************** 2. row ***************************
database: insight_jobadmin
standby_remote: standby
role: primary
epoch: 1
replication_lag_millis: 46458
last_update: 2025-11-11 09:47:14
current_error: failed to commit chunks on destDB: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp: lookup insight-dolt-1.insight-dolt-headless.sit.svc.cluster.local on 10.96.0.10:53: no such host"
*************************** 3. row ***************************
database: insight_metabase
standby_remote: standby
role: primary
epoch: 1
replication_lag_millis: 27876
last_update: 2025-11-11 09:47:33
current_error: failed to commit chunks on destDB: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp: lookup insight-dolt-1.insight-dolt-headless.sit.svc.cluster.local on 10.96.0.10:53: no such host"
*************************** 4. row ***************************
database: insight_system
standby_remote: standby
role: primary
epoch: 1
replication_lag_millis: 0
last_update: 2025-11-11 08:58:50
current_error: NULL
4 rows in set (0.00 sec)
After dolt-1 came back online, the results are as follows:
insight_datawarehouse/main*> select * from dolt_cluster.dolt_cluster_status \G;
*************************** 1. row ***************************
database: insight_datawarehouse
standby_remote: standby
role: primary
epoch: 1
replication_lag_millis: NULL
last_update: NULL
current_error: failed to commit chunks on destDB: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp: lookup insight-dolt-1.insight-dolt-headless.sit.svc.cluster.local on 10.96.0.10:53: no such host"
*************************** 2. row ***************************
database: insight_jobadmin
standby_remote: standby
role: primary
epoch: 1
replication_lag_millis: 0
last_update: 2025-11-11 09:50:45
current_error: NULL
*************************** 3. row ***************************
database: insight_metabase
standby_remote: standby
role: primary
epoch: 1
replication_lag_millis: 995
last_update: 2025-11-11 09:51:04
current_error: NULL
*************************** 4. row ***************************
database: insight_system
standby_remote: standby
role: primary
epoch: 1
replication_lag_millis: 0
last_update: 2025-11-11 08:58:50
current_error: NULL
4 rows in set (0.00 sec)
It appears that insight_datawarehouse cannot synchronize from dolt-0 to dolt-1. Why is this happening?
After waiting for a few minutes, the status of insight_datawarehouse has returned to normal.
insight_datawarehouse/main*> select * from dolt_cluster.dolt_cluster_status \G; *************************** 1. row *************************** database: insight_datawarehouse standby_remote: standby role: primary epoch: 1 replication_lag_millis: 0 last_update: 2025-11-11 09:50:39 current_error: NULL
*************************** 2. row *************************** database: insight_jobadmin standby_remote: standby role: primary epoch: 1 replication_lag_millis: 0 last_update: 2025-11-11 09:56:16 current_error: NULL
*************************** 3. row *************************** database: insight_metabase standby_remote: standby role: primary epoch: 1 replication_lag_millis: 3139 last_update: 2025-11-11 09:56:18 current_error: NULL
*************************** 4. row *************************** database: insight_system standby_remote: standby role: primary epoch: 1 replication_lag_millis: 0 last_update: 2025-11-11 08:58:50 current_error: NULL
4 rows in set (0.00 sec)
Hi @lihh1992,
Thanks for the detailed repro and probe settings. We’ve tracked the crash to a corrupt manifest on the standby: once dolt sql-server tries to open that repo it gets a nil DoltDB, and later the access-mode check dereferences that nil, producing the panic you saw:
panic: runtime error: invalid memory address or nil pointer dereference github.com/dolthub/dolt/go/libraries/doltcore/doltdb.(*DoltDB).AccessMode(0x440cbd0?)
We’re making Dolt handle this more gracefully. If a manifest won’t parse, the server will stay up, mark the database offline, and emit an actionable error instead of crashing. That keeps the primary and other replicas available while you reclone or restore the broken repo.
On the operational side:
livenessProbe … timeoutSeconds: 1 readinessProbe … timeoutSeconds: 1
The Kubernetes docs describe PVCs as the pod’s “disk request” and probes as the mechanism the kubelet uses to decide whether a container is ready or needs a restart (PVC, Probe Behavior). With 1‑second limits, the replica gets restarted the moment the SQL server pauses to download or commit large batches of chunks. An empty PVC has to rebuild a 3.2 GB database; being killed mid-transfer explains the transient DNS errors:
failed to commit chunks on destDB: … dial tcp: lookup insight-dolt-1… no such host
Every restart discards whatever replication work had been done so far (downloaded chunks, staged commits, updated manifests), forcing the next run to start over. Once the replica finally survives long enough without a restart, it finishes the sync and the errors clear.
I suspect the rapid restarts are also what corrupted the manifest in the first place: when the kubelet aborts the process mid-write, the manifest file can be left truncated or with garbage contents even if the underlying data is fine. A longer-term change could be to harden Dolt’s storage layer, i.e. writing manifests atomically (temp file + rename).
One immediate fix could be to seed the standby (run dolt clone onto the PVC before the pod starts) or relax the probes so the initial sync can finish. It looks like the replica only has a one-second window atm.
@lihh1992 I've discussed this further with @reltuk and to get a concrete repro I'd like to request logs from your server process. Also, if possible, a snapshot would help further progress. In the meantime, I'll still hold true to providing an explicit error which will still prevent the server from starting up.