core icon indicating copy to clipboard operation
core copied to clipboard

Fix websocket back pressure bottleneck

Open bdraco opened this issue 1 year ago • 5 comments

Proposed change

This is yet another redux of Client unable to keep up with pending messages with some more insight on how it can happen.

Our websocket implementation is backed by an asyncio.Queue

As back-pressure builds, the queue will back up and use more memory until we disconnect the client when the queue size reaches MAX_PENDING_MSG. When we are generating a high volume of websocket messages, we hit a bottleneck in aiohttp where it will wait for the buffer to drain before sending the next message and messages start backing up in the queue.

https://github.com/aio-libs/aiohttp/issues/1367 added drains to the websocket writer to handle malicious clients and network issues. The drain causes multiple problems for us since the buffer cannot be drained fast enough when we deliver a high volume or large messages:

  • We end up disconnecting the client. The client will then reconnect, and the cycle repeats itself, which results in a significant amount of CPU usage.

  • Messages latency increases because messages cannot be moved into the TCP buffer because it is blocked waiting for the drain to happen because of the low default limit of 16KiB. By increasing the limit, we instead rely on the underlying TCP buffer and stack to deliver the messages which can typically happen much faster.

After the auth phase is completed, and we are not concerned about the user being a malicious client, we set the limit to force a drain to 1MiB. 1MiB is the maximum expected size of the serialized entity registry (for display), which is the largest message we usually send.

https://github.com/aio-libs/aiohttp/commit/b3c80ee3f7d5d8f0b8bc27afe52e4d46621eaf99 added a way to set the limit, but there is no way to actually reach the code to set the limit, so we have to set it directly.

I started looking into this more because of https://github.com/alexarch21/history-explorer-card/issues/156#issuecomment-1475420034 since the results of the websocket having this level of latency where unexpected. In the future we plan on delivering binary data over it as well so any unexpected latency is of great interest in solving.

Type of change

  • [ ] Dependency upgrade
  • [x] Bugfix (non-breaking change which fixes an issue)
  • [ ] New integration (thank you!)
  • [ ] New feature (which adds functionality to an existing integration)
  • [ ] Deprecation (breaking change to happen in the future)
  • [ ] Breaking change (fix/feature causing existing functionality to break)
  • [ ] Code quality improvements to existing code or addition of tests

Additional information

  • This PR fixes or closes issue: fixes #
  • This PR is related to issue:
  • Link to documentation pull request:

Checklist

  • [x] The code change is tested and works locally.
  • [ ] Local tests pass. Your PR cannot be merged unless tests pass
  • [ ] There is no commented out code in this PR.
  • [ ] I have followed the development checklist
  • [ ] I have followed the perfect PR recommendations
  • [ ] The code has been formatted using Black (black --fast homeassistant tests)
  • [ ] Tests have been added to verify that the new code works.

If user exposed functionality or configuration variables are added/changed:

If the code communicates with devices, web services, or third-party tools:

  • [ ] The manifest file has all fields filled out correctly.
    Updated and included derived files by running: python3 -m script.hassfest.
  • [ ] New or updated dependencies have been added to requirements_all.txt.
    Updated by running python3 -m script.gen_requirements_all.
  • [ ] For the updated dependencies - a link to the changelog, or at minimum a diff between library versions is added to the PR description.
  • [ ] Untested files have been added to .coveragerc.

To help with the load of incoming pull requests:

bdraco avatar Mar 18 '23 04:03 bdraco

Hey there @home-assistant/core, mind taking a look at this pull request as it has been labeled with an integration (websocket_api) you are listed as a code owner for? Thanks!

Code owner commands

Code owners of websocket_api can trigger bot actions by commenting:

  • @home-assistant close Closes the pull request.
  • @home-assistant rename Awesome new title Renames the pull request.
  • @home-assistant reopen Reopen the pull request.
  • @home-assistant unassign websocket_api Removes the current integration label and assignees on the pull request, add the integration domain after the command.

home-assistant[bot] avatar Mar 18 '23 04:03 home-assistant[bot]

Did more redesign work for this and made aiohttp subclasses to avoid the protected attribute. It grew to over 100 lines so I abandoned that approach. Setting the protected attribute is the least maintenance burden here since it's a one line change.

bdraco avatar Mar 20 '23 16:03 bdraco

Need to update the comments here as I understand this a lot better now from the above goose chase

bdraco avatar Mar 20 '23 16:03 bdraco

With the current design we have to ways of managing back pressure:

  • Aiohttp built-in 16k drain helper
  • the maximum queue size

We keep increasing the queue size hoping to stay under the disconnect limit but since there is the second back pressure management going on in aiohttp it's not effective and the one there becomes a bottleneck for our back pressure management strategy.

We should raise the aiohttp limit even more since we have our own back pressure management strategy and having two of them ends up causing unneeded disconnects which reconnect right away and use orders of magnitude more processing time than what the originally added back pressure in aiohttp was trying to prevent.

bdraco avatar Mar 20 '23 16:03 bdraco

When testing this I found a few places on the frontend we still deliver the entire entity registry ~ 2.4MiB. It only happens on one of my instances though so I left the limit at 1MiB. I'm going to do a deep dive later today on the frontend and see if we can improve that.

bdraco avatar Mar 20 '23 17:03 bdraco

Found the extra entity registry load. Its coming from lovelace-card-tools: opened https://github.com/thomasloven/lovelace-card-tools/pull/71

bdraco avatar Mar 20 '23 21:03 bdraco

Dug through the aiohttp code again to see if there is another way. On the balance I think this is the best solution right now since anything else has more maintenance burden.

bdraco avatar Mar 21 '23 07:03 bdraco