nchan Fix various memory corruption and unwanted behavior scenarios

This PR fixes the following issues:

Fix possible memory corruption when using upstream requests
Respect client_max_body_size for incoming websocket messages
Fix memory corruption in is_utf8() when block being checked is >16KB
Fix file-cached messages (over 16KB) not getting passed to upstream -----(After 2020-11-14)-----
Fix crash when using X-Accel-Redirect with message forwarding -----(After 2020-12-05)-----
Fix crash when multi-id pubsub is closed by timeout
Fix crash when multi-id pubsub is closed by HTTP DELETE
Fix crash when forwarding used without proxy_pass

1. Fix possible memory corruption when using upstream requests

There are several possible configuration scenarios where using nchan_publisher_upstream_request can cause memory corruption and undefined behavior. This is due to sanity checks being performed (and wrongly succeeding) on a variable which is never assigned.

Example

Consider this stack trace: stack

At frame 6: The ws_publish_data_t struct is allocated here without clearing memory (line 662). Thus, (ws_publish_data_t)->subrequest refers to an undefined memory location (line 690). It is to be assigned here after the subrequest setup is complete, and not referenced upon until then. This is normally the case, as the subrequest gets 'queued' and we go all the way back down to frame 2 before the subrequest is actually executed and finalized.

At frame 13: However, if for whatever reason subrequest setup fails, nginx finalizes the subrequest early. For instance, here the setup fails because the request is too large (line 989).

At frame 15: Since the request is being finalized, NChan subrequest handler gets called, expecting the upstream subrequest to have been successfully queued and (ws_publish_data_t)->subrequest assigned. But we're still in the same event that will only assign (ws_publish_data_t)->subrequest once we get back down to frame 6.

At frame 16: A sanity check assertion (line 575) can wrongly succeed here due to (ws_publish_data_t)->subrequest having an undefined value. Thus, past this check we can get undefined behavior and memory corruption.

At frame 17: We'll most likely segfault here from trying to dereference (ws_publish_data_t)->subrequest, but we can also corrupt memory and/or cause some other undefined behavior instead!

Fix

Clear (ws_publish_data_t)->subrequest right after the creating the ws_publish_data_t structure, so that various further sanity checks will correctly interrupt execution on error.

============================================================================

2. Respect client_max_body_size for incoming websocket messages

The client_max_body_size directive was not respected for incoming websocket messages. This causes potential DDoS problems and, in conjunction with nchan_publisher_upstream_request a possible crash. (It also caused possible memory corruption, but has been fixed above).

If nchan_publisher_upstream_request is not being used, this is not a very serious issue, as the message length is limited by memory limits and the worst that can happen is it can flood a channel group's memory limits.

However, with nchan_publisher_upstream_request, this can be used as a potential DDoS vector and to crash the nginx server. The former is due to nginx writing temporary files for any messages exceeding 16KB. It can be used by an attacker to exhaust disk space on the host system with no way to set limits of any kind, as the incoming websocket message will always be accepted, regardless of the active client_max_body_size setting. Additionally, if the incoming message exceeds the client_max_body_size setting, upstream subrequest forwarding will fail, because at that point nginx will reject the subrequest. This will cause NChan to bring down the server worker with the error message nginx: /nchan-1.2.7/src/subscribers/websocket.c:575: websocket_publish_upstream_handler: Assertion 'd->subrequest' failed. (This can instead cause memory corruption and undefined behavior without the other fix above).

Fix:

Check the configured client_max_body_size upon receiving a message for publishing and close the socket if it is in excess of the active client_max_body_size setting. This is currently done in 3 places:

Upon receiving any websocket packet with WEBSOCKET_OPCODE_TEXT or WEBSOCKET_OPCODE_BINARY
Upon joining the received message so far
While inflating the final received message (to prevent 'decompression-bombs')

============================================================================

3. Fix memory corruption in `is_utf8()` when block being checked is >16KB

Blocks larger than 16KB get written to a temporary file. When checking such a block, an incorrect memory region was getting unmapped due to incorrect pointer handling. This manifested in occasional segfaults.

============================================================================

4. Fix file-cached messages (over 16KB) not getting passed to upstream.

When using nchan_publisher_upstream_request, requests over 16KB get stored to temporary files for processing. These were neither getting passed upstream nor deleted after processing because (ngx_buf_t)->memory was getting set even if the buffer was file-mapped. This would result in the following text in the nginx error log [alert] 27766#27766: *1 zero size buf in output, the message not being passed upstream, temporary files remaining on disk forever, and nginx workers' virtual memory size growing since the file-mapped memory regions were not getting unmapped.

Fix:

The offending line was commented out. (ngx_buf_t)->memory is already set for buffers that are actually in memory, so I don't see why this flag was getting set manually here. Was there some reason?

============================================================================

Note

This is part of a series of patches to resolve issue #588.

Nov 01 '20 09:11 sobitcorp

:+1: LGTM

Nov 08 '20 01:11 jmealo

5. Fix crash when using X-Accel-Redirect with message forwarding

When using X-Accel-Redirect to abstract a websocket pub/sub endpoint from the end client, combined with upstream message forwarding with nchan_publisher_upstream_request, the nginx worker thread would very often segfault after receiving and trying to broadcast the 2nd message after channel creation. Rarely, when this didn't happen, it would often segfault instead on websocket disconnect. This was due to memory corruption.

This behavior could be mitigated with placement of charset off directives in the nginx configuration file. However, this never completely eliminated the problem in my tests.

Here's roughly how this would occur:

First request:

Soon after receiving a response from upstream (in ngx_http_upstream_process_header()) , a charset_filter_module context gets allocated in ngx_http_main_request_charset() on the current memory pool ( with ngx_pcalloc) and set for the request structure: ngx_http_set_ctx(r->main, ctx, ngx_http_charset_filter_module); in ngx_http_charset_filter_modeule.c, line 409.
Later in the request's processing pipeline ngx_http_charset_body_filter() (ngx_http_charset_filter_modeule.c, line 547) is called multiple times and uses the context set above.
I assume the above mentioned memory pool is then freed upon completion of the upstream request.

Second request:

The request structure gets reused. For some reason it's charset_filter_module context is still set to the value in step 1.
ngx_http_main_request_charset() is called, but does nothing, because it sees the request's charset_filter_module context is already set.
Somewhere between here, the freed memory pointed to by the current request structure's charset_filter_module context gets overwritten.
Later in the requests processing pipeline ngx_http_charset_body_filter() is, again, called multiple times. It attempts to use use the now clobbered charset_filter_module context -- Segfault!

Fix

Clear the current request structure's charset_filter_module context upon completion of the request (after completely receiving, storing, and broadcasting a message). There may be a better way to do this. (This will most likely not be an issue at all in Nchan2).

============================================================================

Nov 15 '20 06:11 sobitcorp

6-7. Fix crash when multi-id pubsub is closed by timeout or HTTP DELETE

When using multi-id pubsub endpoints, a variety of problems occurred upon the closing of the pubsub endpoint and/or its channels.

Example configuration:

nchan_pubsub websocket;
nchan_channel_id 1 2;
nchan_subscriber_timeout 1m;

Consequences:

Regardless of how endpoint was closed:
- Channel never gets deleted. Debug log says: MEMSTORE:00: not ready to reap m/ /1 /2 : status 2
If endpoint was closed by nchan_subscriber_timeout triggering or one of the channel IDs receiving an HTTP DELETE:
- Channel never gets deleted.
- Error log says: MEMSTORE:04: tried adding WAITING chanhead 00BDDB08 m/ /1 /2 to chanhead_gc. why?
- Messages sometimes stop arriving. Reconnecting the websocket sometimes temporarily fixes this.
- Eventually worker thread crashes with a segfault.

Fix

This line sub->dequeue_after_response = 1 appears to have caused closing procedures to happen out of order. Commenting it out completely fixes this behavior. Once the websocket is closed, the pub/sub endpoints are properly dequeued and deleted, as are the target channels (once there are no pub/subs left).

============================================================================

8. Fix crash when forwarding used without proxy_pass

This directly addresses issue #588. It fixes the crash caused by this scenario, but does not enable its full functionality - message modification, in particular, doesn't work. Since there is an acceptable workaround, I will not attempt to fix this further - at least it's no longer possible to crash the worker thread with this scenario.

Fix

This simply populates the (ws_publish_data_t)->subrequest early to allow the sequence of events to complete on the same worker cycle without crashing. Basic upstream responses without message modification work fine.

============================================================================

This concludes my series of fixes for Nchan.

Dec 06 '20 07:12 sobitcorp

Let's get this merged indeed! There's a lot to regression-test here. I will report my findings once the testing is complete.

Dec 07 '20 19:12 slact

@sobitcorp please email me at leo (at) nchan.io , i'd like to talk to you a little more about this PR

Dec 07 '20 19:12 slact

Any news about this PR? This seems to fix all the issues I'm currently facing in my production environment!

Jan 31 '21 08:01 aliqandil

@aliqandil slact is currently regression-testing this, afaik.

@jilles-sg Good catch, I fixed the uninitialized var. I also agree with your proposed change to the calling convention. I simply didn't want to make any major calling convention changes, as I'm not too deeply familiar with nchan internals, so I just tacked a result flag at the end as a quick fix.

Feb 24 '21 23:02 sobitcorp

Yup. The sub->dequeue_after_response = 1 and the charset_filter_module fixes aren't passing regression testing, and I haven't had time to come up with a fix. I'm leaning towards merging the other stuff in the meantime and leave these two for later.

Feb 25 '21 02:02 slact

Just checking in to see if there are any updates on this pull request. We are still seeing lots of segfaults and general protection faults on our instances, and it is beginning to become a problem under load. We think at least some of these segfaults are related to bugs this pull request is aiming to fix.

@slact I know you are busy on the new nchan (which is great, really appreciate your work!), but Is there any chance these fixes (or at least the ones that succeeded regression testing) could be merged in an upcoming release, along with the recent commit fixing the segfault during multi-publishing? It would greatly help us maintain stability, at least until the new nchan is released (as we do not really have an alternative currently).

If we can help in any way with this testing or debugging, please let me know as well. Happy to help in any way possible.

Jun 04 '21 09:06 sg-jorrit

We are also very much looking forward to this, after seeing nchan come to a halt on a traffic spike and not recover (service was running, but wouldn't accept new connections even though CPU was back down to 0).

Oct 20 '21 16:10 winrid

nchan nchan copied to clipboard

Fix various memory corruption and unwanted behavior scenarios

1. Fix possible memory corruption when using upstream requests

Example

Fix

2. Respect client_max_body_size for incoming websocket messages

Fix:

3. Fix memory corruption in is_utf8() when block being checked is >16KB

4. Fix file-cached messages (over 16KB) not getting passed to upstream.

Fix:

Note

5. Fix crash when using X-Accel-Redirect with message forwarding

Fix

6-7. Fix crash when multi-id pubsub is closed by timeout or HTTP DELETE

Fix

8. Fix crash when forwarding used without proxy_pass

Fix

nchan
nchan copied to clipboard

3. Fix memory corruption in `is_utf8()` when block being checked is >16KB