core icon indicating copy to clipboard operation
core copied to clipboard

Fix issues with statistics caused by race conditions

Open unfug-at-github opened this issue 1 year ago • 1 comments

Breaking change

Proposed change

The statistics sensor is using an internal list to keep the history of previous states. This list is accessed by various functions in an asynchronous manner (new sensor values arrive, sensor values become outdated and historical data loaded from the database is added to the list). There is no synchronization to prevent parallel access to the list, which can lead to all kind of issues caused by race conditions. The proposed change synchronizes access to the list to prevent that it will be modified while statistics are being computed. Among other things this will fix the issues of incorrect computations (spikes) in the statistical values. This issue is caused by new sensor values arriving before historic values were loaded from the database.

Type of change

  • [ ] Dependency upgrade
  • [x] Bugfix (non-breaking change which fixes an issue)
  • [ ] New integration (thank you!)
  • [ ] New feature (which adds functionality to an existing integration)
  • [ ] Deprecation (breaking change to happen in the future)
  • [ ] Breaking change (fix/feature causing existing functionality to break)
  • [ ] Code quality improvements to existing code or addition of tests

Additional information

  • This PR fixes or closes issue: fixes #119738 #98262 #67627
  • This PR is related to issue:
  • Link to documentation pull request:

Checklist

  • [x] The code change is tested and works locally.
  • [x] Local tests pass. Your PR cannot be merged unless tests pass
  • [x] There is no commented out code in this PR.
  • [x] I have followed the development checklist
  • [x] I have followed the perfect PR recommendations
  • [x] The code has been formatted using Ruff (ruff format homeassistant tests)
  • [x] Tests have been added to verify that the new code works.

If user exposed functionality or configuration variables are added/changed:

If the code communicates with devices, web services, or third-party tools:

  • [ ] The manifest file has all fields filled out correctly.
    Updated and included derived files by running: python3 -m script.hassfest.
  • [ ] New or updated dependencies have been added to requirements_all.txt.
    Updated by running python3 -m script.gen_requirements_all.
  • [ ] For the updated dependencies - a link to the changelog, or at minimum a diff between library versions is added to the PR description.

To help with the load of incoming pull requests:

unfug-at-github avatar Oct 19 '24 18:10 unfug-at-github

Hey there @thomdietrich, mind taking a look at this pull request as it has been labeled with an integration (statistics) you are listed as a code owner for? Thanks!

Code owner commands

Code owners of statistics can trigger bot actions by commenting:

  • @home-assistant close Closes the pull request.
  • @home-assistant rename Awesome new title Renames the pull request.
  • @home-assistant reopen Reopen the pull request.
  • @home-assistant unassign statistics Removes the current integration label and assignees on the pull request, add the integration domain after the command.
  • @home-assistant add-label needs-more-information Add a label (needs-more-information, problem in dependency, problem in custom component) to the pull request.
  • @home-assistant remove-label needs-more-information Remove a label (needs-more-information, problem in dependency, problem in custom component) on the pull request.

home-assistant[bot] avatar Oct 19 '24 18:10 home-assistant[bot]

@unfug-at-github Can you please point to the races fixed by your PR?

Looking at the statistics sensor, I see two issues:

  • The async_added_to_hass coroutine function creates a task which starts listening to state changes from the input sensor and read old state change events from the recorder. There's a lack of synchronization to ensure that task has finished before the statistics sensor starts handling state change events from the input sensor. Is that the main issue your PR is meant to fix? Can't it be fixed by awaiting the loading of events from the recorder, instead of all the changes in the PR?
  • The async_update coroutine function seems to be safe since it never yields by awaiting anything

emontnemery avatar Oct 21 '24 17:10 emontnemery

An approach like this removes the task, without requiring all the other changes:

diff --git a/homeassistant/components/statistics/config_flow.py b/homeassistant/components/statistics/config_flow.py
index 145a7655b36..4280c92131a 100644
--- a/homeassistant/components/statistics/config_flow.py
+++ b/homeassistant/components/statistics/config_flow.py
@@ -169,8 +169,8 @@ class StatisticsConfigFlowHandler(SchemaConfigFlowHandler, domain=DOMAIN):
         vol.Required("user_input"): dict,
     }
 )
-@callback
-def ws_start_preview(
+@websocket_api.async_response
+async def ws_start_preview(
     hass: HomeAssistant,
     connection: websocket_api.ActiveConnection,
     msg: dict[str, Any],
@@ -234,6 +234,6 @@ def ws_start_preview(
     preview_entity.hass = hass

     connection.send_result(msg["id"])
-    connection.subscriptions[msg["id"]] = preview_entity.async_start_preview(
+    connection.subscriptions[msg["id"]] = await preview_entity.async_start_preview(
         async_preview_updated
     )
diff --git a/homeassistant/components/statistics/sensor.py b/homeassistant/components/statistics/sensor.py
index ba98fe3ec6e..7d7ba8e4fd7 100644
--- a/homeassistant/components/statistics/sensor.py
+++ b/homeassistant/components/statistics/sensor.py
@@ -373,8 +373,7 @@ class StatisticsSensor(SensorEntity):
         self._update_listener: CALLBACK_TYPE | None = None
         self._preview_callback: Callable[[str, Mapping[str, Any]], None] | None = None

-    @callback
-    def async_start_preview(
+    async def async_start_preview(
         self,
         preview_callback: Callable[[str, Mapping[str, Any]], None],
     ) -> CALLBACK_TYPE:
@@ -392,7 +391,7 @@ class StatisticsSensor(SensorEntity):

         self._preview_callback = preview_callback

-        self._async_stats_sensor_startup(self.hass)
+        await self._async_stats_sensor_startup()
         return self._call_on_remove_callbacks

     @callback
@@ -413,8 +412,7 @@ class StatisticsSensor(SensorEntity):
         if not self._preview_callback:
             self.async_write_ha_state()

-    @callback
-    def _async_stats_sensor_startup(self, _: HomeAssistant) -> None:
+    async def _async_stats_sensor_startup(self) -> None:
         """Add listener and get recorded state."""
         _LOGGER.debug("Startup for %s", self.entity_id)
         self.async_on_remove(
@@ -425,13 +423,11 @@ class StatisticsSensor(SensorEntity):
             )
         )
         if "recorder" in self.hass.config.components:
-            self.hass.async_create_task(self._initialize_from_database())
+            await self._initialize_from_database()

     async def async_added_to_hass(self) -> None:
         """Register callbacks."""
-        self.async_on_remove(
-            async_at_start(self.hass, self._async_stats_sensor_startup)
-        )
+        await self._async_stats_sensor_startup()

     def _add_state_to_queue(self, new_state: State) -> None:
         """Add the state to the queue."""

emontnemery avatar Oct 21 '24 17:10 emontnemery

@unfug-at-github Can you please point to the races fixed by your PR?

Looking at the statistics sensor, I see two issues:

  • The async_added_to_hass coroutine function creates a task which starts listening to state changes from the input sensor and read old state change events from the recorder. There's a lack of synchronization to ensure that task has finished before the statistics sensor starts handling state change events from the input sensor. Is that the main issue your PR is meant to fix? Can't it be fixed by awaiting the loading of events from the recorder, instead of all the changes in the PR?
  • The async_update coroutine function seems to be safe since it never yields by awaiting anything

You are right, regarding the main issue.

There are always many ways to solve a problem. I have to admit that this is my first larger contact with Python and I am coming from a Java / C / C# / C++ background. I have only started to understand how concurrency is handled in Python, and I only looked at this module and didn't go through the whole core code to understand how home assistant is handling it's threads. This may have led to a solution that is a little "safer" than required regarding concurrency. The proposed solution is safe even if parallel threads would be doing the updates, which I understand is not the case.

Nevertheless, I think moving the buffer functions into a separate class makes sense. The statistics module has grown pretty large and is cluttered with functions solving different issues. I guess following Martin Fowlers advice of "one class - one functionality" makes the whole thing easier to understand.

unfug-at-github avatar Oct 22 '24 06:10 unfug-at-github

@unfug-at-github We won't accept a solution which changes more than 600 lines of code to fix a race condition if the problem can be solved by just changing ~25 lines of code unless there's a very good reason to do so.

Maybe splitting out the state buffer to a separate class is motivated, but I don't think it's needed to solve the race condition.

Some background:

Home Assistant uses the Python asyncio framework for all its internals. The framework is essentially single threaded with a single event loop thread, although jobs can optionally be handed off to so called "executor threads". Unless jobs executed by executors do non thread safe operation on objects owned by the event loop thread, there are no issues caused by a lack of thread safeness. Awaitable tasks, also called coroutines, may however yield when they await another task. This can cause races, and I believe this is causing the problem seen at startup of the statistics sensor.

Because Home Assistant has chosen asyncio, we basically never need to worry about thread safeness, we only need to worry about races introduced by yielding coroutines.

This may seem like a very poor way to do concurrency 2024, and it is. The crux is that CPython itself doesn't support multiple threads executing Python code in parallel because of the so called GIL, meaning performance with asyncio's single event loop thread is not worse than if Home Assistant was written multi threaded. The approach to concurrency in Python may change in favor of multi threading in a few years if https://peps.python.org/pep-0703/ is well received by the Python eco system.

We can discuss more on discord if you want, I'm @emontnemery there too.

emontnemery avatar Oct 22 '24 06:10 emontnemery

@emontnemery: What about the following proposal: I remove all changes in the actual statistics functions that add significantly to the number of changed lines. This will make the code less thread safe, but as you explained it is not an issue because of the way home assistant and Python are handling things.

Of course you could make this fix even smaller by integrating the changes into the existing functions. However, I would prefer to keep the buffer functionality in a separate file. This is making the code more readable and easier to understand. In my opinion refactoring is needed from time to time. If you keep on making every change in the smallest possible way, you will end up in code that becomes extremely cluttered and impossible to maintain after a while.

unfug-at-github avatar Oct 22 '24 09:10 unfug-at-github

@unfug-at-github Please look into my proposed solution first. The intention with it is to remove the race introduced by starting off, and not waiting for, the task which reads from the recorder. Removing an unnecessary race must always be preferred over writing code which handles a race.

Refactoring to a different abstraction for managing the state buffer may be a good idea, but I think that's for a separate PR?

emontnemery avatar Oct 22 '24 09:10 emontnemery

@emontnemery: I have created a new PR based on your approach (#129066) including an adapted test of this solution. Closing this one.

unfug-at-github avatar Oct 24 '24 07:10 unfug-at-github