SQLite failed due to missing DB table
Yesterday I launched the single process for development: temporal server start-dev. Usually, I keep that running for a couple of days without any issues.
But today, I got this HTTP 503 response on the web UI:
{"statusCode":503,"statusText":"Service Unavailable","response":{},"message":"GetClusterMetadata operation failed. Error: SQL logic error: no such table: cluster_metadata_info (1)"}
So I had to restart the process. I'm still trying to figure out how to reproduce or if this is a real issue, so I'm leaving this here for a record in case that repeats or we can better understand the problem.
Some log snippets:
{"level":"error","ts":"2023-01-06T10:01:33.589-0800","msg":"Operation failed with internal error.","error":"GetMetadata operation failed. Error: SQL logic error: no such table: namespace_metadata (1)","metric-scope":55,"logging-call-at":"persistenceMetricClients.go:1579","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\tgo.temporal.io/[email protected]/common/log/zap_logger.go:143\ngo.temporal.io/server/common/persistence.(*metricEmitter).updateErrorMetric\n\tgo.temporal.io/[email protected]/common/persistence/persistenceMetricClients.go:1579\ngo.temporal.io/server/common/persistence.(*metadataPersistenceClient).GetMetadata\n\tgo.temporal.io/[email protected]/common/persistence/persistenceMetricClients.go:908\ngo.temporal.io/server/common/persistence.(*metadataRetryablePersistenceClient).GetMetadata.func1\n\tgo.temporal.io/[email protected]/common/persistence/persistenceRetryableClients.go:901\ngo.temporal.io/server/common/backoff.ThrottleRetryContext\n\tgo.temporal.io/[email protected]/common/backoff/retry.go:194\ngo.temporal.io/server/common/persistence.(*metadataRetryablePersistenceClient).GetMetadata\n\tgo.temporal.io/[email protected]/common/persistence/persistenceRetryableClients.go:905\ngo.temporal.io/server/common/namespace.(*registry).refreshNamespaces\n\tgo.temporal.io/[email protected]/common/namespace/registry.go:426\ngo.temporal.io/server/common/namespace.(*registry).refreshLoop\n\tgo.temporal.io/[email protected]/common/namespace/registry.go:403\ngo.temporal.io/server/internal/goro.(*Handle).Go.func1\n\tgo.temporal.io/[email protected]/internal/goro/goro.go:64"}
{"level":"error","ts":"2023-01-06T10:01:33.589-0800","msg":"Operation failed with internal error.","error":"GetTaskQueue operation failed. Failed to check if task queue default-worker-tq of type Workflow existed. Error: SQL logic error: no such table: task_queues (1)","metric-scope":41,"logging-call-at":"persistenceMetricClients.go:1579","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\tgo.temporal.io/[email protected]/common/log/zap_logger.go:143\ngo.temporal.io/server/common/persistence.(*metricEmitter).updateErrorMetric\n\tgo.temporal.io/[email protected]/common/persistence/persistenceMetricClients.go:1579\ngo.temporal.io/server/common/persistence.(*taskPersistenceClient).GetTaskQueue\n\tgo.temporal.io/[email protected]/common/persistence/persistenceMetricClients.go:724\ngo.temporal.io/server/service/matching.(*taskQueueDB).takeOverTaskQueueLocked\n\tgo.temporal.io/[email protected]/service/matching/db.go:123\ngo.temporal.io/server/service/matching.(*taskQueueDB).RenewLease\n\tgo.temporal.io/[email protected]/service/matching/db.go:109\ngo.temporal.io/server/service/matching.(*taskWriter).renewLeaseWithRetry.func1\n\tgo.temporal.io/[email protected]/service/matching/taskWriter.go:302\ngo.temporal.io/server/common/backoff.ThrottleRetryContext\n\tgo.temporal.io/[email protected]/common/backoff/retry.go:194\ngo.temporal.io/server/service/matching.(*taskWriter).renewLeaseWithRetry\n\tgo.temporal.io/[email protected]/service/matching/taskWriter.go:306\ngo.temporal.io/server/service/matching.(*taskWriter).initReadWriteState\n\tgo.temporal.io/[email protected]/service/matching/taskWriter.go:131\ngo.temporal.io/server/service/matching.(*taskWriter).taskWriterLoop\n\tgo.temporal.io/[email protected]/service/matching/taskWriter.go:221\ngo.temporal.io/server/internal/goro.(*Handle).Go.func1\n\tgo.temporal.io/[email protected]/internal/goro/goro.go:64"}
Expected Behavior
No issues.
Actual Behavior
A single process failed due to a missing DB table.
Steps to Reproduce the Problem
Unknown. I was not able to construct reproducible steps.
What I did initially:
% temporal server start-dev- do something for a day
- next day, it fails
Specifications
- Version:
% temporal -v
temporal version 0.2.0 (server 1.18.5) (ui 2.9.0)
- Platform:
% uname -mrs
Darwin 21.6.0 arm64
The same issues have been reported for version temporal version 0.5.0 (server 1.20.0) (ui 2.10.3):
{"level":"error","ts":"2023-02-22T08:56:41.887-0800","msg":"Operation failed with internal error.","error":"ListNamespaces operation failed. Failed to get namespace rows. Error: SQL logic error: no such table: namespaces (1)","operation":"ListNamespaces","logging-call-at":"persistenceMetricClients.go:1171","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\tgo.temporal.io/[email protected]/common/log/zap_logger.go:150\ngo.temporal.io/server/common/persistence.updateErrorMetric\n\tgo.temporal.io/[email protected]/common/persistence/persistenceMetricClients.go:1171\ngo.temporal.io/server/common/persistence.(*metricEmitter).recordRequestMetrics\n\tgo.temporal.io/[email protected]/common/persistence/persistenceMetricClients.go:1148\ngo.temporal.io/server/common/persistence.(*metadataPersistenceClient).ListNamespaces.func1\n\tgo.temporal.io/[email protected]/common/persistence/persistenceMetricClients.go:683\ngo.temporal.io/server/common/persistence.(*metadataPersistenceClient).ListNamespaces\n\tgo.temporal.io/[email protected]/common/persistence/persistenceMetricClients.go:685\ngo.temporal.io/server/common/persistence.(*metadataRetryablePersistenceClient).ListNamespaces.func1\n\tgo.temporal.io/[email protected]/common/persistence/persistenceRetryableClients.go:887\ngo.temporal.io/server/common/backoff.ThrottleRetryContext\n\tgo.temporal.io/[email protected]/common/backoff/retry.go:199\ngo.temporal.io/server/common/persistence.(*metadataRetryablePersistenceClient).ListNamespaces\n\tgo.temporal.io/[email protected]/common/persistence/persistenceRetryableClients.go:891\ngo.temporal.io/server/common/namespace.(*registry).refreshNamespaces\n\tgo.temporal.io/[email protected]/common/namespace/registry.go:386\ngo.temporal.io/server/common/namespace.(*registry).refreshLoop\n\tgo.temporal.io/[email protected]/common/namespace/registry.go:357\ngo.temporal.io/server/internal/goro.(*Handle).Go.func1\n\tgo.temporal.io/[email protected]/internal/goro/goro.go:64"}
{"level":"error","ts":"2023-02-22T08:56:41.892-0800","msg":"Operation failed with internal error.","error":"GetTaskQueue operation failed. Failed to check if task queue default-worker-tq of type Workflow existed. Error: SQL logic error: no such table: task_queues (1)","operation":"GetTaskQueue","logging-call-at":"persistenceMetricClients.go:1171","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\tgo.temporal.io/[email protected]/common/log/zap_logger.go:150\ngo.temporal.io/server/common/persistence.updateErrorMetric\n\tgo.temporal.io/[email protected]/common/persistence/persistenceMetricClients.go:1171\ngo.temporal.io/server/common/persistence.(*metricEmitter).recordRequestMetrics\n\tgo.temporal.io/[email protected]/common/persistence/persistenceMetricClients.go:1148\ngo.temporal.io/server/common/persistence.(*taskPersistenceClient).GetTaskQueue.func1\n\tgo.temporal.io/[email protected]/common/persistence/persistenceMetricClients.go:567\ngo.temporal.io/server/common/persistence.(*taskPersistenceClient).GetTaskQueue\n\tgo.temporal.io/[email protected]/common/persistence/persistenceMetricClients.go:569\ngo.temporal.io/server/service/matching.(*taskQueueDB).takeOverTaskQueueLocked\n\tgo.temporal.io/[email protected]/service/matching/db.go:123\ngo.temporal.io/server/service/matching.(*taskQueueDB).RenewLease\n\tgo.temporal.io/[email protected]/service/matching/db.go:109\ngo.temporal.io/server/service/matching.(*taskWriter).renewLeaseWithRetry.func1\n\tgo.temporal.io/[email protected]/service/matching/taskWriter.go:302\ngo.temporal.io/server/common/backoff.ThrottleRetryContext\n\tgo.temporal.io/[email protected]/common/backoff/retry.go:199\ngo.temporal.io/server/service/matching.(*taskWriter).renewLeaseWithRetry\n\tgo.temporal.io/[email protected]/service/matching/taskWriter.go:306\ngo.temporal.io/server/service/matching.(*taskWriter).initReadWriteState\n\tgo.temporal.io/[email protected]/service/matching/taskWriter.go:131\ngo.temporal.io/server/service/matching.(*taskWriter).taskWriterLoop\n\tgo.temporal.io/[email protected]/service/matching/taskWriter.go:221\ngo.temporal.io/server/internal/goro.(*Handle).Go.func1\n\tgo.temporal.io/[email protected]/internal/goro/goro.go:64"}
Posting some context from @yiminc:
Note that if the last database connection in the pool closes, the in-memory database is deleted. Make sure the max idle connection limit is > 0, and the connection lifetime is infinite.
I have a similar issue:
error while fetching cluster metadata: operation GetClusterMetadata encountered table cluster_metadata_info does not exist
Linking for visibility: https://github.com/temporalio/cli/issues/124
I've been observing multiple flakes of this error message in TS SDK's integration tests recently. To be exact, 11 times in the last 3 weeks, vs none before that (as far as I can see in GHA logs).
In the context of those CI jobs, it only happens with the CLI Dev Server started at the GHA job level (i.e. not with Dev Server instances started using the SDK's built-in TestWorkflowEnvironment), using CLI 0.12.0 and 0.13.2. Interestingly, 9 times out of 11, the "error" started at almost the same place during the tests, in "Worker Lifecycle" tests.
I have modified the CI workflow to retain the server's logs on failure. Hopefully, I may be able to provide more data on this soon.
We have pushed a fix to the SQLite driver: https://gitlab.com/cznic/sqlite/-/merge_requests/74. The next temporal release(v1.26) will have this fix: https://github.com/temporalio/temporal/pull/6836
@prathyushpv We are still experiencing this regularly on 1.27. We deploy small testing environments that run temporal containers in docker-compose in developer mode (so using local sqlite databases). In these environments we regularly get spammed with log messages saying Operation failed with internal error." error="ListNamespaces operation failed. Failed to get namespace rows. Error: SQL logic error: no such table: namespaces (1)