architecture
architecture copied to clipboard
Create separate ThreadPoolExecutor for core system usage
Problem
We use one ThreadPoolExecutor to arrange all sync version functions, such as Entity.update() and util.json.load_json(). The size of this pool is automatically determined.
Changed in version 3.5: If max_workers is None or not given, it will default to the number of processors on the machine, multiplied by 5 ref1
In some case, such as home-assistant/home-assistant#21925, the whole thread pool could be blocked by bad integration or upstream lib, then it will cause the whole system unusable.
Proposal
Therefore, I suggest we create a separate ThreadPoolExecutor, named core_executor, only execute the core functions, such as util.json.load_json() used in storage helper. We can have two options to do that
-
add executor parameter in
HomeAssistant.async_add_executor_jobparameter -
add an annotation on the function we want to only be executed in core_executor
Alternative
Allow user change the max_works via core configuration or environment variable.
Rewrite async friendly util.json.load_json() and other core still-sync functions.
So besides a bad integration taking the system down, how common is this? As far as I know, this is really the first occurrence of this problem (that we know). Maybe we should keep an eye on it before trying to solve it?
btw, can't read files inside asyncio, that's executor only. One thing that might help, and I'm considering, is doing all the registry and auth reads before setting up a single component.
So bad integrations taking the system down definitely happen. I've managed to do this to my testing environment more than once while working on a component and accidentally doing I/O in a poor manner. Hopefully, they happen more in devs testing environments than production, but things can definitely slip through.
We had a case with tellduslive where a lock in the library both ate threads and held up the event loop.
Downside of using a separate thread pool is that problems will be harder to notice. The user will notice quite fast if there's a serious problem with the thread pool in home assistant and report it. With a separate thread pool, problems will be harder to diagnose.
Maybe have a core executor just during startup? There are very few writes after that.
Reads are also done trough the SyncWorkers, and such a read is done when the Home Assistant front page is opened (for the translations).
My system has gone from stable to almost unusable in the last few releases.
I strongly suspect it is related to the number of workers. A 433mhz button that fires an automation used to work within a couple of seconds. I can now sit there waiting for up to 30 seconds to a minute for it to work. Whilst in that state if I go to the front end it won't load either. I assume it is also waiting on the loop to be available?
I'm sure it will be a bad integration but there is no quick and easy way to find out which one.
So I think this is a good idea.
This architecture issue is old, stale, and possibly obsolete. Things changed a lot over the years. Additionally, we have been moving to discussions for these architectural discussions.
For that reason, I'm going to close this issue.
../Frenck