risingwave icon indicating copy to clipboard operation
risingwave copied to clipboard

storage: compactor cannot recover from ObjectIoError

Open hzxa21 opened this issue 3 years ago • 7 comments
trafficstars

Describe the bug After running a simple materialized source workload for a while, compaction fails due to ObjectStoreIOError in compactor and compactor fails to re-register itself in meta causing compaction to stop.

2022-08-02T05:16:20.288023Z  WARN risingwave_storage::hummock::compactor: Compaction task 16362 failed with error: ObjectStore failed with IO error error reading a body from connection: Connection reset by peer (os error 104).
  backtrace of `TracedHummockError`:
   0: <risingwave_storage::hummock::error::HummockError as core::convert::From<risingwave_storage::hummock::error::HummockErrorInner>>::from
             at ./ebs/risingwave/src/storage/src/hummock/error.rs:55:10
   1: <T as core::convert::Into<U>>::into
             at ./rustc/9067d5277d10f0f32a49ec9c125a33828e26a32b/library/core/src/convert/mod.rs:550:9
   2: risingwave_storage::hummock::error::HummockError::object_io_error
             at ./ebs/risingwave/src/storage/src/hummock/error.rs:65:9
   3: core::ops::function::FnOnce::call_once
             at ./rustc/9067d5277d10f0f32a49ec9c125a33828e26a32b/library/core/src/ops/function.rs:248:5
   4: core::result::Result<T,E>::map_err
             at ./rustc/9067d5277d10f0f32a49ec9c125a33828e26a32b/library/core/src/result.rs:855:27
   5: risingwave_storage::hummock::sstable_store::SstableStore::load_table::{{closure}}::{{closure}}::{{closure}}
             at ./ebs/risingwave/src/storage/src/hummock/sstable_store.rs:295:46
  ...


2022-08-02T05:16:25.118053Z  WARN risingwave_storage::hummock::compactor: Compaction task 16535 failed with error: Other error meta cache lookup request dedup get cancel: RecvError(()).
  backtrace of `TracedHummockError`:
   0: <risingwave_storage::hummock::error::HummockError as core::convert::From<risingwave_storage::hummock::error::HummockErrorInner>>::from
             at ./ebs/risingwave/src/storage/src/hummock/error.rs:55:10
   1: <T as core::convert::Into<U>>::into
             at ./rustc/9067d5277d10f0f32a49ec9c125a33828e26a32b/library/core/src/convert/mod.rs:550:9
   2: risingwave_storage::hummock::error::HummockError::other
             at ./ebs/risingwave/src/storage/src/hummock/error.rs:117:9
   3: risingwave_storage::hummock::sstable_store::SstableStore::load_table::{{closure}}::{{closure}}
             at ./ebs/risingwave/src/storage/src/hummock/sstable_store.rs:316:21
  ...


2022-08-02T05:16:55.603589Z  WARN risingwave_rpc_client::meta_client: Failed to send_heartbeat: error gRPC status: status: Internal, message: "Unknown worker", details: [], metadata: MetadataMap { headers: {"content-type": "application/grpc", "date": "Tue, 02 Aug 2022 05:16:54 GMT", "content-length": "0"} }
2022-08-02T05:16:56.604109Z  WARN risingwave_rpc_client::meta_client: Failed to send_heartbeat: error gRPC status: status: Internal, message: "Unknown worker", details: [], metadata: MetadataMap { headers: {"content-type": "application/grpc", "date": "Tue, 02 Aug 2022 05:16:56 GMT", "content-length": "0"} }
...
2022-08-02T05:17:54.604136Z  WARN risingwave_storage::hummock::compactor: Subscribing to compaction tasks failed with error: gRPC status: status: Internal, message: "Error while interact with meta service: invalid hummock context 5", details: [], metadata: MetadataMap { headers: {"content-type": "application/grpc", "date": "Tue, 02 Aug 2022 05:17:53 GMT", "content-length": "0"} }. Will retry.
...

To Reproduce Run a simple 3-node bench with the following workload:

create materialized source s1 (v1 int, v2 float) with (                                                                                                                                                                                                                                                                                                             connector = 'datagen',
    fields.v1.min = '1',
    fields.v1.max  = '100000',
    fields.v2.min = '1',
    fields.v2.max = '100000',
    datagen.rows.per.second='300000',
    datagen.split.num = '12'
) row format json;

The issue happened after the workload runs for ~30mins.

Expected behavior Compactor recovers from the error and compaction continues.

Additional context Add any other context about the problem here. compactor-log.txt meta-log.txt

hzxa21 avatar Aug 02 '22 06:08 hzxa21

I also witness a weird S3 error failed to construct request: NoRegion after restarting the EC2 instance as well as the docker container for compactor:

2022-08-02T11:41:39.979530Z  WARN risingwave_storage::hummock::compactor: Compaction task 18673 failed with error: ObjectStore failed with IO error failed to construct request: NoRegion.
  backtrace of `TracedHummockError`:
   0: <risingwave_storage::hummock::error::HummockError as core::convert::From<risingwave_storage::hummock::error::HummockErrorInner>>::from
             at ./ebs/risingwave/src/storage/src/hummock/error.rs:55:10
   1: <T as core::convert::Into<U>>::into
             at ./rustc/9067d5277d10f0f32a49ec9c125a33828e26a32b/library/core/src/convert/mod.rs:550:9
   2: risingwave_storage::hummock::error::HummockError::object_io_error
             at ./ebs/risingwave/src/storage/src/hummock/error.rs:65:9
   3: core::ops::function::FnOnce::call_once
             at ./rustc/9067d5277d10f0f32a49ec9c125a33828e26a32b/library/core/src/ops/function.rs:248:5
   4: core::result::Result<T,E>::map_err
             at ./rustc/9067d5277d10f0f32a49ec9c125a33828e26a32b/library/core/src/result.rs:855:27
   5: risingwave_storage::hummock::sstable_store::SstableStore::load_table::{{closure}}::{{closure}}::{{closure}}
             at ./ebs/risingwave/src/storage/src/hummock/sstable_store.rs:286:43
   6: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
             at ./rustc/9067d5277d10f0f32a49ec9c125a33828e26a32b/library/core/src/future/mod.rs:91:19
   7: risingwave_common::cache::LruCache<K,T>::lookup_with_request_dedup::{{closure}}::{{closure}}
             at ./ebs/risingwave/src/common/src/cache.rs:786:38
   8: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
             at ./rustc/9067d5277d10f0f32a49ec9c125a33828e26a32b/library/core/src/future/mod.rs:91:19

No sure whether there is any misconfugration.

hzxa21 avatar Aug 02 '22 12:08 hzxa21

It's somewhat strange that we can't recover. Technically, we should already be able to recover. If task fails, the task will be marked as failed, and meta reschedules compaction of freed SSTs to another task.

Its very possible that this is out of scope of failure recovery, i.e. the error is one that recurrs outside of Hummock's boundaries.

jon-chuang avatar Aug 08 '22 08:08 jon-chuang

Anyway, according to logs, we do indeed recover eventually, it’s just we spend a long time encountering network-related errors.

jon-chuang avatar Aug 08 '22 14:08 jon-chuang

Also, it’s weird that meta logs only report vacuum tasks, while compactor logs only report compaction tasks. IMO they should each report both.

it’s also weird that meta logs start saying compaction task not found at some point.

Also, does the compactor disconnect and then reregister itself to meta? Or is new compactor started?

jon-chuang avatar Aug 08 '22 14:08 jon-chuang

Anyway, according to logs, we do indeed recover eventually, it’s just we spend a long time encountering network-related errors.

compactor reocovers from network errors automatically but it fails to re-register itself to meta. compaction is still stuck due to "no available compactor" so technically the cluster is still in a bad state

hzxa21 avatar Aug 08 '22 15:08 hzxa21

Also, it’s weird that meta logs only report vacuum tasks, while compactor logs only report compaction tasks. IMO they should each report both.

it’s also weird that meta logs start saying compaction task not found at some point.

Also, does the compactor disconnect and then reregister itself to meta? Or is new compactor started?

The logs show that compactor disconnects and does not reregister itself to meta

hzxa21 avatar Aug 08 '22 15:08 hzxa21

it’s also weird that meta logs start saying compaction task not found at some point.

  1. meta thinks the compactor node is dead, and removes it and unassigns its compaction tasks
  2. compactor finishes and reports the task.
  3. meta rejects inexistent task report, aka "compaction task not found"

zwang28 avatar Aug 10 '22 06:08 zwang28

Hello

After various experiments by intentionally causing errors, I think I now have a better understanding what happens. That said, I do not know the ultimate root cause. The main problem that happened is that the network connections of the compactor did not work as intended. I do not know why, but I am confident that it caused all the errors mentioned above.

Problems Mentioned Above

Let us go through the errors. [C] Denotes errors in the compactor's log and [M] in the meta node's log. Since the problem was posted, some error massages changed a bit since then but are still the same in essence.

  • [C] Failed to send_heartbeat: timeout deadline has elapsed Means exactly that. Eventually, the meta node considers the compactor as expired.

  • [C] Failed to send_heartbeat: error gRPC status: status: Internal, message: "Unknown worker" [...] This happens after the compactor expired. It sends heartbeats again, but they are not accepted by the meta node.

  • [C] Compaction task 16362 failed with error: ObjectStore failed [...] Connection reset by peer (os error 104). It appears that the connection to the S3 storage was ended by the storage. Currently, the compactor will report an unsuccessful compaction to the meta node which will then reschedule that compaction.

  • [C] Subscribing to compaction tasks failed with error: [...] Will retry. After experiencing the compactor, the meta node does no longer accept requests for new compaction jobs from that compactor.

  • [M] Deleted expired worker [...] After 60 seconds without heartbeat, the meta node considers workers as expired and removes them.

  • [M] No compactor is available. and No vacuum worker is available. Since we only use one compactor node and it expired, the meta node has no node to assign compaction and vacuum jobs.

  • [M] Compaction task 17398 not found It can happen that the compactor finishes a compaction task after it expired. It will still report the completion of that task to the meta node. The meta node, however, has already deleted all tasks assigned to that compactor (due to it expiring) and is therefore unable to find the task.

I tried the following things (and combinations of them) and they all allow a recovery:

  • Not sending any heartbeat for less than 60 seconds: Nothing of interest happens here. It is not even noticed in the meta node's log.

  • Error during compaction: The compaction tasks finishes without result. Later tasks assign same tables again. Once compaction tasks no longer fail, there is a complete recovery. Nothing to see in the meta node's log.

  • Very slow compaction task: In general, long compaction tasks are not a problem as long as they finish eventually. It can happen that the meta node eventually gives out the warning No idle compactor available. if too many tasks take too long, or that compactions tasks fail due to subsequent errors. However, even if it takes minutes to finish compaction tasks, the compactor will continue working.

In the end, the compactor was unable to recover only if the meta node considered it as expired (i.e., no heartbeat was received for 60 seconds) regardless of the reason for it. It then causes the compactor to panic and to shut down. Even if panicking is deactivated, it is unable to reconnect.

What I am currently unable to try is a real loss of connection. That is, cutting the wire. I am happy to hear suggestions on how to do that.

The latest commit I used for testing is 2c75b12c790a94e771edeeac702314a6533da596.

Suggestions to Allow Recovery

If a worker expires due to missing heartbeats, it gets deleted from the pool of workers. Subsequently, later attempts to send new heartbeats or to subscribe for compaction tasks fail, because the meta node cannot find a worker with the corresponding ID. I see two options to recover:

  1. Deactivate, but do not delete: I would suggest that, instead of completely deleting them, we only flag expired workers as such. We still remove them from subscription lists and remove their tasks from the list of submitted tasks, but we do not completely forget about them. Once the meta node receives a heartbeat again, we mark the corresponding worker as running. I did some early testing on this, and it seems to work well.
  2. Completely restart worker: Once the worker recognises that it was expired, it restarts itself completely. This is a client-side approach and seems more complicated to me. It probably needs to be implemented individually for each type of worker. I did not try it yet.

ALeitert avatar Aug 19 '22 04:08 ALeitert

@ALeitert We have been investigating the heartbeat timeout issue recently and found that this is related to EC2 network credit. In other words, we think that because the compactor is in heavy load reading from and writing to S3, the network credit is used up for a single compactor EC2 instance. This leads to the degraded network performance, causing the heartbeat RPC packet queue up or drop.

You can see the discussion in this thread.

We have a fix (#4633) to limit the number of concurrent compaction tasks in a single compactor to avoid stressing the compactor. There are still things we should do but didn't do:

  • Use more granular indicators for compactor load in addition to number of concurrent compaction tasks.
  • Backpressure writes into storage when compaction cannot keep up.
  • Auto restart dead compactor worker (via docker/k8s after the pod healthcheck fails)
  • Scale out compactor if we don't have enough compaction resources.

hzxa21 avatar Aug 19 '22 05:08 hzxa21

This issue has been open for 60 days with no activity. Could you please update the status? Feel free to continue discussion or close as not planned.

github-actions[bot] avatar Oct 31 '22 02:10 github-actions[bot]

Looks like it could be closed ?

fuyufjh avatar Oct 31 '22 02:10 fuyufjh