envoy icon indicating copy to clipboard operation
envoy copied to clipboard

Handling WASM failures in onRequestHeaders causes hang

Open vchigrin opened this issue 9 months ago • 8 comments

Description:

  1. Write WASM plugin that causes "VM failure" in onRequestHeaders callback (e.g. by calling std::abort() in case C++ implementation).
  2. Load it in Envoy with failure_policy: FAIL_CLOSED setting in .yaml file.
  3. Use curl to make HTTP request that should go through that plugin

Expected behavior: According to documentation here https://github.com/envoyproxy/envoy/blob/ad400974d8114c9f2da6d96a4acf92e6601d5561/api/envoy/extensions/wasm/v3/wasm.proto#L35-L36 HTTP 503 must be returned

Observed behavior: curl hangs, without returning any response. Observed on 1.33 version of Envoy.

I think that returning HTTP 503 immediately is better choice, since in that case client can have more time to perform some fallback behavior during handling this request.

failure_policy: FAIL_OPEN causes hang only first few requests (seems count is equal to number of worker threads), and after that further requests bypass plugin successfully.

[optional Relevant Links:] Guess related piece of code is that https://github.com/proxy-wasm/proxy-wasm-cpp-host/blob/c4d7bb0fda912e24c64daf2aa749ec54cec99412/src/context.cc#L328

since StopAllIterationAndWatermark documentation says https://github.com/envoyproxy/envoy/blob/ad400974d8114c9f2da6d96a4acf92e6601d5561/envoy/http/filter.h#L100-L109 that continueDecoding() function must be called to resume request processing.

I did not check this hypothesis though.

vchigrin avatar Mar 19 '25 15:03 vchigrin

cc @wbpcode

botengyao avatar Mar 19 '25 19:03 botengyao

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

github-actions[bot] avatar Apr 19 '25 00:04 github-actions[bot]

Sorry for the delay first.

It's weird, as far as I know we have a unit test that cover this case and ensure the sendLocalReply will be called when vm is crashed.

could you provide an example implementation that could trigger this?

Thanks for reporting this anyway.

wbpcode avatar Apr 20 '25 09:04 wbpcode

Yes, certainly. Here is simplest WASM plugin written in C++ to demonstrate problem https://gist.github.com/vchigrin/4ad1f0f1eeb72979131774ad9b20b965 Example envoy configuration https://gist.github.com/vchigrin/ab6ef7bcf9e84686901850b420a5faa2 (assumes plugin in /tmp/wasm_bug_proof.wasm.

If it will be loaded in envoy with say, --concurrency 2 switch in command line, then first two curl request usually provide onRequestHeaders. BEFORE records in log, and thats all. Both first two, and all further requests to envoy will hang without producing HTTP 5xx responses.

Just tested it on envoy-1.34.0-linux-x86_64 binary downloaded from GitHub.

vchigrin avatar Apr 21 '25 08:04 vchigrin

@wbpcode I investigated this issue and made fix https://github.com/proxy-wasm/proxy-wasm-cpp-host/pull/432/files - let me know what do you think? On my test WASM plugin that fixed issue, and Envoy started to answer with HTTP 503 response codes.

Seems, problem was in the following scenario:

  1. Failure during onRequestHeaders callback caused envoy to send local reply.
  2. But at that time it calls again to plugin, now with onResponseHeaders callback.
  3. Since stream_failed_ flag already set, no more local replies sending attempted, but plugin returned StopAllIterationAndWatermark before fix.
  4. That caused proxy to defer this request, so it hanged indefinite long time.

vchigrin avatar May 07 '25 13:05 vchigrin

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

github-actions[bot] avatar Jun 06 '25 16:06 github-actions[bot]

@wbpcode @RyanTheOptimist @botengyao Please, don't close this issue automatically. I provided:

  1. Simple plugin to demonstrate the problem (https://github.com/envoyproxy/envoy/issues/38801#issuecomment-2817952282)
  2. Description about causes of the problem in code (https://github.com/envoyproxy/envoy/issues/38801#issuecomment-2858660954)
  3. PR that fixes this issue (https://github.com/proxy-wasm/proxy-wasm-cpp-host/pull/432/files)

vchigrin avatar Jun 13 '25 12:06 vchigrin

@wbpcode Could I do anything else to help with this issue?

vchigrin avatar Jun 17 '25 12:06 vchigrin

@wbpcode Could I do anything else to help with this issue?

Sorry, I am a little busy and missed this issue. I will take a look this weekend.

wbpcode avatar Jun 27 '25 02:06 wbpcode

Please, don't forget about this issue. There is just one line of code in patch to fix it.

vchigrin avatar Jul 09 '25 08:07 vchigrin

Please, don't forget about this issue. There is just one line of code in patch to fix it.

Sorry so much. I did forget this. After a check to this problem, yeah, it's a bug. Before https://github.com/envoyproxy/envoy/pull/24367, the crash of vm will result in send local reply and the encoder filter chain will executed (before the stream_failed_ is set) and the encodeHeaders and try to send the local reply again and finally result in the Envoy to send local reply directly and skip the filter chain.

But after the https://github.com/envoyproxy/envoy/pull/24367, the send local reply is splited two phases: prepare the local reply and send the local reply actually. That make the encoder filter chain will executed after the stream_failed_ and will result in the request hang.

I also checked your fix. But I am afraid that may not be the correctly way to fix the bug. It should works because the returned status will be convert to stop iteration anyway if local response is sent. But proxy_wasm community is something out of our scope and I will provide a fix at the Envoy directly.

Finally, thanks so much for the resport and the continued track to this issue and sorry again for the delay !!!

wbpcode avatar Jul 09 '25 14:07 wbpcode