temporal icon indicating copy to clipboard operation
temporal copied to clipboard

SQLite failed due to missing DB table

Open mindaugasrukas opened this issue 2 years ago • 5 comments

Yesterday I launched the single process for development: temporal server start-dev. Usually, I keep that running for a couple of days without any issues. But today, I got this HTTP 503 response on the web UI:

{"statusCode":503,"statusText":"Service Unavailable","response":{},"message":"GetClusterMetadata operation failed. Error: SQL logic error: no such table: cluster_metadata_info (1)"}

So I had to restart the process. I'm still trying to figure out how to reproduce or if this is a real issue, so I'm leaving this here for a record in case that repeats or we can better understand the problem.

Some log snippets:

{"level":"error","ts":"2023-01-06T10:01:33.589-0800","msg":"Operation failed with internal error.","error":"GetMetadata operation failed. Error: SQL logic error: no such table: namespace_metadata (1)","metric-scope":55,"logging-call-at":"persistenceMetricClients.go:1579","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\tgo.temporal.io/[email protected]/common/log/zap_logger.go:143\ngo.temporal.io/server/common/persistence.(*metricEmitter).updateErrorMetric\n\tgo.temporal.io/[email protected]/common/persistence/persistenceMetricClients.go:1579\ngo.temporal.io/server/common/persistence.(*metadataPersistenceClient).GetMetadata\n\tgo.temporal.io/[email protected]/common/persistence/persistenceMetricClients.go:908\ngo.temporal.io/server/common/persistence.(*metadataRetryablePersistenceClient).GetMetadata.func1\n\tgo.temporal.io/[email protected]/common/persistence/persistenceRetryableClients.go:901\ngo.temporal.io/server/common/backoff.ThrottleRetryContext\n\tgo.temporal.io/[email protected]/common/backoff/retry.go:194\ngo.temporal.io/server/common/persistence.(*metadataRetryablePersistenceClient).GetMetadata\n\tgo.temporal.io/[email protected]/common/persistence/persistenceRetryableClients.go:905\ngo.temporal.io/server/common/namespace.(*registry).refreshNamespaces\n\tgo.temporal.io/[email protected]/common/namespace/registry.go:426\ngo.temporal.io/server/common/namespace.(*registry).refreshLoop\n\tgo.temporal.io/[email protected]/common/namespace/registry.go:403\ngo.temporal.io/server/internal/goro.(*Handle).Go.func1\n\tgo.temporal.io/[email protected]/internal/goro/goro.go:64"}
{"level":"error","ts":"2023-01-06T10:01:33.589-0800","msg":"Operation failed with internal error.","error":"GetTaskQueue operation failed. Failed to check if task queue default-worker-tq of type Workflow existed. Error: SQL logic error: no such table: task_queues (1)","metric-scope":41,"logging-call-at":"persistenceMetricClients.go:1579","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\tgo.temporal.io/[email protected]/common/log/zap_logger.go:143\ngo.temporal.io/server/common/persistence.(*metricEmitter).updateErrorMetric\n\tgo.temporal.io/[email protected]/common/persistence/persistenceMetricClients.go:1579\ngo.temporal.io/server/common/persistence.(*taskPersistenceClient).GetTaskQueue\n\tgo.temporal.io/[email protected]/common/persistence/persistenceMetricClients.go:724\ngo.temporal.io/server/service/matching.(*taskQueueDB).takeOverTaskQueueLocked\n\tgo.temporal.io/[email protected]/service/matching/db.go:123\ngo.temporal.io/server/service/matching.(*taskQueueDB).RenewLease\n\tgo.temporal.io/[email protected]/service/matching/db.go:109\ngo.temporal.io/server/service/matching.(*taskWriter).renewLeaseWithRetry.func1\n\tgo.temporal.io/[email protected]/service/matching/taskWriter.go:302\ngo.temporal.io/server/common/backoff.ThrottleRetryContext\n\tgo.temporal.io/[email protected]/common/backoff/retry.go:194\ngo.temporal.io/server/service/matching.(*taskWriter).renewLeaseWithRetry\n\tgo.temporal.io/[email protected]/service/matching/taskWriter.go:306\ngo.temporal.io/server/service/matching.(*taskWriter).initReadWriteState\n\tgo.temporal.io/[email protected]/service/matching/taskWriter.go:131\ngo.temporal.io/server/service/matching.(*taskWriter).taskWriterLoop\n\tgo.temporal.io/[email protected]/service/matching/taskWriter.go:221\ngo.temporal.io/server/internal/goro.(*Handle).Go.func1\n\tgo.temporal.io/[email protected]/internal/goro/goro.go:64"}

Expected Behavior

No issues.

Actual Behavior

A single process failed due to a missing DB table.

Steps to Reproduce the Problem

Unknown. I was not able to construct reproducible steps.

What I did initially:

  • % temporal server start-dev
  • do something for a day
  • next day, it fails

Specifications

  • Version:
% temporal -v              
temporal version 0.2.0 (server 1.18.5) (ui 2.9.0)
  • Platform:
% uname -mrs   
Darwin 21.6.0 arm64

mindaugasrukas avatar Jan 06 '23 19:01 mindaugasrukas

The same issues have been reported for version temporal version 0.5.0 (server 1.20.0) (ui 2.10.3):

{"level":"error","ts":"2023-02-22T08:56:41.887-0800","msg":"Operation failed with internal error.","error":"ListNamespaces operation failed. Failed to get namespace rows. Error: SQL logic error: no such table: namespaces (1)","operation":"ListNamespaces","logging-call-at":"persistenceMetricClients.go:1171","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\tgo.temporal.io/[email protected]/common/log/zap_logger.go:150\ngo.temporal.io/server/common/persistence.updateErrorMetric\n\tgo.temporal.io/[email protected]/common/persistence/persistenceMetricClients.go:1171\ngo.temporal.io/server/common/persistence.(*metricEmitter).recordRequestMetrics\n\tgo.temporal.io/[email protected]/common/persistence/persistenceMetricClients.go:1148\ngo.temporal.io/server/common/persistence.(*metadataPersistenceClient).ListNamespaces.func1\n\tgo.temporal.io/[email protected]/common/persistence/persistenceMetricClients.go:683\ngo.temporal.io/server/common/persistence.(*metadataPersistenceClient).ListNamespaces\n\tgo.temporal.io/[email protected]/common/persistence/persistenceMetricClients.go:685\ngo.temporal.io/server/common/persistence.(*metadataRetryablePersistenceClient).ListNamespaces.func1\n\tgo.temporal.io/[email protected]/common/persistence/persistenceRetryableClients.go:887\ngo.temporal.io/server/common/backoff.ThrottleRetryContext\n\tgo.temporal.io/[email protected]/common/backoff/retry.go:199\ngo.temporal.io/server/common/persistence.(*metadataRetryablePersistenceClient).ListNamespaces\n\tgo.temporal.io/[email protected]/common/persistence/persistenceRetryableClients.go:891\ngo.temporal.io/server/common/namespace.(*registry).refreshNamespaces\n\tgo.temporal.io/[email protected]/common/namespace/registry.go:386\ngo.temporal.io/server/common/namespace.(*registry).refreshLoop\n\tgo.temporal.io/[email protected]/common/namespace/registry.go:357\ngo.temporal.io/server/internal/goro.(*Handle).Go.func1\n\tgo.temporal.io/[email protected]/internal/goro/goro.go:64"} {"level":"error","ts":"2023-02-22T08:56:41.892-0800","msg":"Operation failed with internal error.","error":"GetTaskQueue operation failed. Failed to check if task queue default-worker-tq of type Workflow existed. Error: SQL logic error: no such table: task_queues (1)","operation":"GetTaskQueue","logging-call-at":"persistenceMetricClients.go:1171","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\tgo.temporal.io/[email protected]/common/log/zap_logger.go:150\ngo.temporal.io/server/common/persistence.updateErrorMetric\n\tgo.temporal.io/[email protected]/common/persistence/persistenceMetricClients.go:1171\ngo.temporal.io/server/common/persistence.(*metricEmitter).recordRequestMetrics\n\tgo.temporal.io/[email protected]/common/persistence/persistenceMetricClients.go:1148\ngo.temporal.io/server/common/persistence.(*taskPersistenceClient).GetTaskQueue.func1\n\tgo.temporal.io/[email protected]/common/persistence/persistenceMetricClients.go:567\ngo.temporal.io/server/common/persistence.(*taskPersistenceClient).GetTaskQueue\n\tgo.temporal.io/[email protected]/common/persistence/persistenceMetricClients.go:569\ngo.temporal.io/server/service/matching.(*taskQueueDB).takeOverTaskQueueLocked\n\tgo.temporal.io/[email protected]/service/matching/db.go:123\ngo.temporal.io/server/service/matching.(*taskQueueDB).RenewLease\n\tgo.temporal.io/[email protected]/service/matching/db.go:109\ngo.temporal.io/server/service/matching.(*taskWriter).renewLeaseWithRetry.func1\n\tgo.temporal.io/[email protected]/service/matching/taskWriter.go:302\ngo.temporal.io/server/common/backoff.ThrottleRetryContext\n\tgo.temporal.io/[email protected]/common/backoff/retry.go:199\ngo.temporal.io/server/service/matching.(*taskWriter).renewLeaseWithRetry\n\tgo.temporal.io/[email protected]/service/matching/taskWriter.go:306\ngo.temporal.io/server/service/matching.(*taskWriter).initReadWriteState\n\tgo.temporal.io/[email protected]/service/matching/taskWriter.go:131\ngo.temporal.io/server/service/matching.(*taskWriter).taskWriterLoop\n\tgo.temporal.io/[email protected]/service/matching/taskWriter.go:221\ngo.temporal.io/server/internal/goro.(*Handle).Go.func1\n\tgo.temporal.io/[email protected]/internal/goro/goro.go:64"}

mindaugasrukas avatar Feb 27 '23 18:02 mindaugasrukas

Posting some context from @yiminc:

Note that if the last database connection in the pool closes, the in-memory database is deleted. Make sure the max idle connection limit is > 0, and the connection lifetime is infinite.

bergundy avatar Feb 27 '23 19:02 bergundy

I have a similar issue:

error while fetching cluster metadata: operation GetClusterMetadata encountered table cluster_metadata_info does not exist

ThePlenkov avatar Mar 21 '23 11:03 ThePlenkov

Linking for visibility: https://github.com/temporalio/cli/issues/124

mindaugasrukas avatar Jul 13 '23 04:07 mindaugasrukas

I've been observing multiple flakes of this error message in TS SDK's integration tests recently. To be exact, 11 times in the last 3 weeks, vs none before that (as far as I can see in GHA logs).

In the context of those CI jobs, it only happens with the CLI Dev Server started at the GHA job level (i.e. not with Dev Server instances started using the SDK's built-in TestWorkflowEnvironment), using CLI 0.12.0 and 0.13.2. Interestingly, 9 times out of 11, the "error" started at almost the same place during the tests, in "Worker Lifecycle" tests.

I have modified the CI workflow to retain the server's logs on failure. Hopefully, I may be able to provide more data on this soon.

mjameswh avatar Jul 12 '24 00:07 mjameswh

We have pushed a fix to the SQLite driver: https://gitlab.com/cznic/sqlite/-/merge_requests/74. The next temporal release(v1.26) will have this fix: https://github.com/temporalio/temporal/pull/6836

prathyushpv avatar Dec 05 '24 18:12 prathyushpv

@prathyushpv We are still experiencing this regularly on 1.27. We deploy small testing environments that run temporal containers in docker-compose in developer mode (so using local sqlite databases). In these environments we regularly get spammed with log messages saying Operation failed with internal error." error="ListNamespaces operation failed. Failed to get namespace rows. Error: SQL logic error: no such table: namespaces (1)

eldondevat avatar Jun 16 '25 14:06 eldondevat