quickwit icon indicating copy to clipboard operation
quickwit copied to clipboard

null value in column node_id crashes indexing pipelines after migration issues

Open EthanGeekFan opened this issue 1 year ago • 2 comments

Describe the bug The problem arise after the problem of "migration 17 was previously applied but missing in resolved migrations". When previous migration problem happened, we tried recover from it by deleting all migrations in _sql_migrations where version>=17. The server then starts successfully. However, while ingesting, the null value problem happens and crashes indexing pipeline. Which crashes other services since the ingest queue will fill up quickly after this and reject all ingestion request.

Wanted to ask if there are any ways to recover from this kind of error and not loosing data.

Also wanted to know the cause of the "migration 17 was previously applied but missing in resolved migrations" problem and how to recover from it.

Logs:


2024-08-27T21:43:46.327Z ERROR index-doc-batches{index_id=cerus-logfile-rawline-v1 source_id=_ingest-api-source pipeline_uid=01J6AVPG88MT5G9WRV3NNP6C4Q workbench_id=01J6AWMW6MK06EVTNKHV7KKDE0}:uploader:stage_splits{split_ids="[\"01J6AWMW959T3G44GZVN5R67E4\"]"}: quickwit_metastore::metastore::postgres::error: postgresql-error error=PgDatabaseError { severity: Error, code: "23502", message: "null value in column \"node_id\" of relation \"splits\" violates not-null constraint", detail: Some("Failing row contains (01J6AWMW959T3G44GZVN5R67E4, Staged, 1724794870, 1724795017, {}, {\"version\":\"0.8\",\"split_id\":\"01J6AWMW959T3G44GZVN5R67E4\",\"index_..., cerus-logfile-rawline-v1:01J4T6W0S1KY145TQ898PEFXFY, 2024-08-27 21:43:46.232142, 2024-08-27 21:43:46.232142, null, 0, 2024-08-29 21:43:46, null)."), hint: None, position: None, where: None, schema: Some("public"), table: Some("splits"), column: Some("node_id"), data_type: None, constraint: None, file: Some("execMain.c"), line: Some(1974), routine: Some("ExecConstraints") }
2024-08-27T21:43:46.327Z  WARN index-doc-batches{index_id=cerus-logfile-rawline-v1 source_id=_ingest-api-source pipeline_uid=01J6AVPG88MT5G9WRV3NNP6C4Q workbench_id=01J6AWMW6MK06EVTNKHV7KKDE0}:uploader:stage_splits{split_ids="[\"01J6AWMW959T3G44GZVN5R67E4\"]"}: quickwit_metastore::metastore::postgres::metastore: rollback
2024-08-27T21:43:46.404Z ERROR quickwit_actors::spawn_builder: actor-failure cause=failed to receive command from uploader

Caused by:
   channel closed exit_status=Failure(failed to receive command from uploader

Caused by:
   channel closed)
2024-08-27T21:43:46.404Z  INFO quickwit_actors::spawn_builder: actor-exit actor_id=quickwit_indexing::actors::sequencer::Sequencer<quickwit_indexing::actors::publisher::Publisher>-cold-DmVd exit_status=failure(cause=failed to receive command from uploader

Caused by:
   channel closed)
2024-08-27T21:43:46.404Z ERROR quickwit_actors::actor_context: exit activating-kill-switch actor=quickwit_indexing::actors::sequencer::Sequencer<quickwit_indexing::actors::publisher::Publisher>-cold-DmVd exit_status=Failure(failed to receive command from uploader

Caused by:
   channel closed)
2024-08-27T21:43:46.404Z  INFO quickwit_actors::spawn_builder: actor-exit actor_id=Publisher-small-JH9u exit_status=killed
2024-08-27T21:43:46.422Z  INFO quickwit_actors::spawn_builder: actor-exit actor_id=Indexer-misty-y1nk exit_status=killed
2024-08-27T21:43:46.422Z  INFO quickwit_actors::spawn_builder: actor-exit actor_id=quickwit_indexing::actors::doc_processor::DocProcessor-late-pbxr exit_status=killed
2024-08-27T21:43:46.422Z  INFO quickwit_actors::spawn_builder: actor-exit actor_id=SourceActor-icy-uHQU exit_status=killed
2024-08-27T21:43:46.434Z  INFO quickwit_actors::spawn_builder: actor-exit actor_id=quickwit_indexing::actors::index_serializer::IndexSerializer-solitary-CV8N exit_status=killed
2024-08-27T21:43:46.434Z  INFO quickwit_actors::spawn_builder: actor-exit actor_id=Packager-black-nUJo exit_status=killed
2024-08-27T21:43:46.435Z  INFO quickwit_actors::spawn_builder: actor-exit actor_id=IndexUploader-cold-zJfk exit_status=killed
2024-08-27T21:43:46.843Z ERROR quickwit_actors::actor_handle: actor-exit-without-success actor="SourceActor-icy-uHQU"
2024-08-27T21:43:46.843Z ERROR quickwit_actors::actor_handle: actor-exit-without-success actor="quickwit_indexing::actors::doc_processor::DocProcessor-late-pbxr"
2024-08-27T21:43:46.843Z ERROR quickwit_actors::actor_handle: actor-exit-without-success actor="Indexer-misty-y1nk"
2024-08-27T21:43:46.843Z ERROR quickwit_actors::actor_handle: actor-exit-without-success actor="quickwit_indexing::actors::index_serializer::IndexSerializer-solitary-CV8N"
2024-08-27T21:43:46.843Z ERROR quickwit_actors::actor_handle: actor-exit-without-success actor="Packager-black-nUJo"
2024-08-27T21:43:46.843Z ERROR quickwit_actors::actor_handle: actor-exit-without-success actor="IndexUploader-cold-zJfk"
2024-08-27T21:43:46.843Z ERROR quickwit_actors::actor_handle: actor-exit-without-success actor="quickwit_indexing::actors::sequencer::Sequencer<quickwit_indexing::actors::publisher::Publisher>-cold-DmVd"
2024-08-27T21:43:46.843Z ERROR quickwit_actors::actor_handle: actor-exit-without-success actor="Publisher-small-JH9u"
2024-08-27T21:43:46.843Z ERROR quickwit_indexing::actors::indexing_pipeline: Indexing pipeline failure. pipeline_id=IndexingPipelineId { node_id: "prime-etl-01", index_uid: IndexUid { index_id: "cerus-logfile-rawline-v1", incarnation_id: Ulid(2083174441638835175660655156021163518) }, source_id: "_ingest-api-source", pipeline_uid: Pipeline(01J6AVPG88MT5G9WRV3NNP6C4Q) } generation=2 healthy_actors=[] failed_or_unhealthy_actors=["SourceActor-icy-uHQU", "quickwit_indexing::actors::doc_processor::DocProcessor-late-pbxr", "Indexer-misty-y1nk", "quickwit_indexing::actors::index_serializer::IndexSerializer-solitary-CV8N", "Packager-black-nUJo", "IndexUploader-cold-zJfk", "quickwit_indexing::actors::sequencer::Sequencer<quickwit_indexing::actors::publisher::Publisher>-cold-DmVd", "Publisher-small-JH9u"] success_actors=[]

Steps to reproduce (if applicable) Steps to reproduce the behavior:

  1. Run one instance of docker quickwit 0.8.2 and another instance of binary quickwit 0.8.0. restarted each for a few times
  2. Either server won't start because of the migration error on postgres
  3. Deleted all migrations >= 17 in postgres using psql
  4. Start one instance of docker quickwit 0.8.2
  5. Ingest data by posting to REST API
  6. Index pipeline crashes and ingestion queue fill up and reject all ingestion requests

Expected behavior The server would reapply necessary migrations and work normally

Configuration: Please provide:

  1. Output of quickwit --version 0.8.2 docker image and 0.8.0 compiled binary

EthanGeekFan avatar Aug 28 '24 21:08 EthanGeekFan

  1. Output of quickwit --version 0.8.2 docker image and 0.8.0 compiled binary

please provide the actual output. main will record as0.8.0 until 0.9 is released.

Quickwit 0.8.0-0.8.2 stop at migration 16. If you have more than that, you ran something else at some point. You should revert migrations 17 to 19|20. You definitely have 19 based on the error you get. You can check if you have 20 by looking at the column doc_mapping_uid from the table shards. If it exists, you have to run the corresponding down migration too. Quickwit doesn't have a builtin way to run the down migrations. You can find them and run them manually from :/quickwit/quickwit-metastore/migrations/postgresql (by copying the statements from the relevant .down.sql files into psql). You'd also need to delete a few entries from _sql_migrations, but by the look of it, you already did.

trinity-1686a avatar Aug 28 '24 21:08 trinity-1686a

Hi,

Thank you so much for this solution. However, this error appears now after downloading migrating the database. I indeed had the 20 migration. And while I ran the down migration commands, I found that ALTER TABLE splits DROP IF EXISTS COLUMN node_id reports syntax error. So I changed them to DROP COLUMN IF EXISTS this is done in every file with this kind of statements. Correct me if I am doing this wrong.

The new error logs: (run by docker image quickwit:0.8.2)


2024-08-29T20:08:07.926Z ERROR index-doc-batches{index_id=cerus-logfile-rawline-v1 source_id=_ingest-api-source pipeline_uid=01J6FV6VNS7T6C6G0HV7CQ7MP1 workbench_id=01J6FVZ4802SA18KNW4SHY1D54}:uploader:stage_splits{split_ids="[\"01J6FVZ4ADV7X7XNKN38HKBXWY\"]"}: quickwit_metastore::metastore::postgres::error: postgresql-error error=PgDatabaseError { severity: Error, code: "42703", message: "record \"new\" has no field \"index_id\"", detail: None, hint: None, position: None, where: Some("SQL statement \"UPDATE indexes SET update_timestamp = NEW.update_timestamp\n        WHERE indexes.index_id = NEW.index_id\"\nPL/pgSQL function set_index_update_timestamp_for_split() line 4 at SQL statement"), schema: None, table: None, column: None, data_type: None, constraint: None, file: Some("pl_exec.c"), line: Some(5601), routine: Some("plpgsql_exec_get_datum_type_info") }
2024-08-29T20:08:07.926Z  WARN index-doc-batches{index_id=cerus-logfile-rawline-v1 source_id=_ingest-api-source pipeline_uid=01J6FV6VNS7T6C6G0HV7CQ7MP1 workbench_id=01J6FVZ4802SA18KNW4SHY1D54}:uploader:stage_splits{split_ids="[\"01J6FVZ4ADV7X7XNKN38HKBXWY\"]"}: quickwit_metastore::metastore::postgres::metastore: rollback
2024-08-29T20:08:08.005Z ERROR quickwit_actors::spawn_builder: actor-failure cause=failed to receive command from uploader

Caused by:
   channel closed exit_status=Failure(failed to receive command from uploader

Caused by:
   channel closed)
2024-08-29T20:08:08.005Z  INFO quickwit_actors::spawn_builder: actor-exit actor_id=quickwit_indexing::actors::sequencer::Sequencer<quickwit_indexing::actors::publisher::Publisher>-twilight-jhhc exit_status=failure(cause=failed to receive command from uploader

Caused by:
   channel closed)
2024-08-29T20:08:08.005Z ERROR quickwit_actors::actor_context: exit activating-kill-switch actor=quickwit_indexing::actors::sequencer::Sequencer<quickwit_indexing::actors::publisher::Publisher>-twilight-jhhc exit_status=Failure(failed to receive command from uploader

Caused by:
   channel closed)
2024-08-29T20:08:08.005Z  INFO quickwit_actors::spawn_builder: actor-exit actor_id=Publisher-wandering-WO4g exit_status=killed
2024-08-29T20:08:08.432Z ERROR quickwit_actors::actor_handle: actor-exit-without-success actor="quickwit_indexing::actors::sequencer::Sequencer<quickwit_indexing::actors::publisher::Publisher>-twilight-jhhc"
2024-08-29T20:08:08.432Z ERROR quickwit_actors::actor_handle: actor-exit-without-success actor="Publisher-wandering-WO4g"
2024-08-29T20:08:08.432Z  INFO quickwit_actors::spawn_builder: actor-exit actor_id=quickwit_indexing::actors::doc_processor::DocProcessor-cold-qs6X exit_status=killed
2024-08-29T20:08:08.432Z  INFO quickwit_actors::spawn_builder: actor-exit actor_id=Indexer-blue-nSk8 exit_status=killed
2024-08-29T20:08:08.432Z ERROR quickwit_indexing::actors::indexing_pipeline: Indexing pipeline failure. pipeline_id=IndexingPipelineId { node_id: "prime-etl-01", index_uid: IndexUid { index_id: "cerus-logfile-rawline-v1", incarnation_id: Ulid(2083174441638835175660655156021163518) }, source_id: "_ingest-api-source", pipeline_uid: Pipeline(01J6FV6VNS7T6C6G0HV7CQ7MP1) } generation=12 healthy_actors=["SourceActor-bitter-gIXI", "quickwit_indexing::actors::doc_processor::DocProcessor-cold-qs6X", "Indexer-blue-nSk8", "quickwit_indexing::actors::index_serializer::IndexSerializer-dark-4GdD", "Packager-little-hAZN", "IndexUploader-still-QQM3"] failed_or_unhealthy_actors=["quickwit_indexing::actors::sequencer::Sequencer<quickwit_indexing::actors::publisher::Publisher>-twilight-jhhc", "Publisher-wandering-WO4g"] success_actors=[]

The quickwit version output for previous 0.8.0 version compiled from source for centos 7:

Quickwit 0.8.0 (x86_64-unknown-linux-gnu unknown unknown)

Thank you so much for helping with this issue! I really appreciate your help!

EthanGeekFan avatar Aug 29 '24 20:08 EthanGeekFan