core
core copied to clipboard
Purge causes recorder to stop writing to the DB.
The problem
Since the last database migration with a focus on POSTGRESQL, I see almost every day that the recorder stops recording data. Restarting HA brings the recording back, but the data in between is lost. The symptoms are the same as in issue #117263, but the background is different in my opinion. I'm using a POSTGRESQL-database on an other machine. Access to the database itself is therefore easier for me to handle.
According to the SQLALCHEMY debug log, the daily database cleanup begins with a SELECT on the "states" table (blocked to 4000 records), followed by an UPDATE of the "old_state_id", again followed by a DELETE. This sequence repeats itself in the example (see LOG) 20 times in about 30 seconds. This is followed by a "SELECT DISTINCT states.attributes_id FROM states WHERE states.attributes_id IN (.....)" THIS statement is useless in my opinion because the "attributes_id" listed in the IN part of the statement is already "unique", so the result of the "SELECT DISTINCT" corresponds to the values sought in the query. The FULL index scan resulting from the SELECT places a massive load on the database without any apparent purpose and leads to a blockage of the recording. As expected, aborting the SELECT statement on the server side leads to a ROLLBACK of the entire PURGE. However, the data buffered in the meantime is cleanly written to the database and a restart of HA is NOT necessary. The database-purge itself is NOT done. If the SELECT DISTINCT is not canceled after a couple of hours memory-consumption on HA is raising and the only method to get a functioning system is a reboot of the complete HA-system; of course also with a ROLLBACK because of the lost connection........
{EDIT] 2024/19/09 After further searching and many hours later, I found the check for "database_engine.optimizer.slow_range_in_select" in the "purge.py" source... and the comment about MariaDB's poor optimizer.... I thought "it can't get any worse" so I looked for the place where it is set.... That led to "util.py"... There I added (in my case in line 601) the statement "slow_range_in_select = True" and started a PURGE. I THINK the purge takes longer, but it works!
What version of Home Assistant Core has the issue?
core-2024.9.1
What was the last working version of Home Assistant Core?
No response
What type of installation are you running?
Home Assistant OS
Integration causing the issue
recorder
Link to integration documentation on our website
Diagnostics information
Example YAML snippet
No response
Anything in the logs that might be useful for us?
No response
Additional information
EXPLAIN of a SHORT form of the "SELECT DISTINCT"-statement:
Unique (cost=0.57..3452037.82 rows=2198 width=8) -> Index Only Scan using ix_states_attributes_id on states (cost=0.57..3448204.77 rows=1533221 width=8) Index Cond: (attributes_id = ANY ('{4734983,21,4816918,23,4751383,9245058,22,24,9378572,9378573,9378574,4800548,9378578,9378579,58,64,4735040,66,67,68,7787168,7822000,6619208,74,7822001,76,77,4767820,9378583,7822002,9378584,83,85,86,92,7787169,94,6619235,9378589,9378590,112,114,115,9378591,7787170,4735095,6783100,4735104,4866181,4735114,7864475,7864476,7864477,6783139,9378604,9378606,9378607,6799568,4899078,9339143,4822851,4771823,7536906,4800789,4735253,4735266,4768038,8257853,9355597,4817229,7995733,7995734,7995735,7639693,345,349,350,4850055,8028605,8028606,8028607,7520730,9378664,7553530,4735485,7979522,7979523,7979524,9378671,9378678,9378679,4804398,3981877,9378680,4735544,9378681,9378682,9378685,6701714,7717528,7717529,7717530,8045216,4754547,9241256,4850348,4768434,9378705,6636221,7848643,9375452,7717573,7717574,7717575,4866756,9378709,773397}'::bigint[]))
JIT: Functions: 2 Options: Inlining true, Optimization true, Expressions true, Deforming true
Hey there @home-assistant/core, mind taking a look at this issue as it has been labeled with an integration (recorder) you are listed as a code owner for? Thanks!
Code owner commands
Code owners of recorder can trigger bot actions by commenting:
@home-assistant closeCloses the issue.@home-assistant rename Awesome new titleRenames the issue.@home-assistant reopenReopen the issue.@home-assistant unassign recorderRemoves the current integration label and assignees on the issue, add the integration domain after the command.@home-assistant add-label needs-more-informationAdd a label (needs-more-information, problem in dependency, problem in custom component) to the issue.@home-assistant remove-label needs-more-informationRemove a label (needs-more-information, problem in dependency, problem in custom component) on the issue.
(message by CodeOwnersMention)
recorder documentation recorder source (message by IssueLinks)
I didn't really see it written, but is it the same when the recorder hangs during autopurge after some 4AM ?
I didn't really see it written, but is it the same when the recorder hangs during autopurge after some 4AM ?
As I understand the "4am hang", YES. But since I have POSTGRES running on another machine and there are some "midnight jobs" there, the HA purge starts at 00:12... (also because that counts as such FOR ME). AROUND 3.5 - 4 hours after starting, there is a sharp increase in memory requirements on HA and shortly afterwards the "recorder hang"... AFTER this point, only a restart of the HA system has ever helped to get a recording again. On the SQL server, the load AVG rises to 5-7 during this period... The "SQL server" is a RASPI-4B with 8GB. It plays a small NAS, DNS for the LAN, the SQL server and little bit system-playground for me...... But it handles it VERY well. Load AVG is usually <= 0.5....
Best regards, Peter
The select distinct optimizes poorly with PostgreSQL because it doesn't support loose/skip index scans https://wiki.postgresql.org/wiki/Loose_indexscan
select
attributes_id
from
states
where
attributes_id in (
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,
15, 16, 17, 18, 19, 23, 25, 26, 27, 28,
29, 30, 31, 32, 33, 34, 35, 36, 37, 40,
41, 42, 43, 44, 45, 47, 48, 49, 50, 51,
52, 53, 54, 55, 56, 57, 58, 59, 60, 61,
62
)
group by
attributes_id;
That might be a bit faster
doesn't really help
# explain select
attributes_id
from
states
where
attributes_id in (
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,
15, 16, 17, 18, 19, 23, 25, 26, 27, 28,
29, 30, 31, 32, 33, 34, 35, 36, 37, 40,
41, 42, 43, 44, 45, 47, 48, 49, 50, 51,
52, 53, 54, 55, 56, 57, 58, 59, 60, 61,
62
)
group by
attributes_id;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Group (cost=0.57..33703.02 rows=2616 width=8)
Group Key: attributes_id
-> Index Only Scan using ix_states_attributes_id on states (cost=0.57..30253.55 rows=1379788 width=8)
Index Cond: (attributes_id = ANY ('{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,23,25,26,27,28,29,30,31,32,33,34,35,36,37,40,41,42,43,44,45,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62}'::bigint[]))
(4 rows)
# explain SELECT
DISTINCT states.attributes_id
FROM
states
WHERE
states.attributes_id IN (
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,
15, 16, 17, 18, 19, 23, 25, 26, 27, 28,
29, 30, 31, 32, 33, 34, 35, 36, 37, 40,
41, 42, 43, 44, 45, 46, 47, 48, 49, 50,
51, 52, 53, 54, 55, 56, 57, 58, 59, 60,
61, 62
);
;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=0.57..34314.72 rows=2616 width=8)
-> Index Only Scan using ix_states_attributes_id on states (cost=0.57..30802.53 rows=1404875 width=8)
Index Cond: (attributes_id = ANY ('{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,23,25,26,27,28,29,30,31,32,33,34,35,36,37,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62}'::bigint[]))
(3 rows)
# explain SELECT
inner_state_attributes.attributes_id AS attributes_id
FROM
state_attributes
JOIN LATERAL (
SELECT
states.attributes_id as attributes_id
FROM
states
WHERE
states.attributes_id = state_attributes.attributes_id
LIMIT
1
) AS inner_state_attributes ON true
WHERE
state_attributes.attributes_id IN (
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,
15, 16, 17, 18, 19, 23, 25, 26, 27, 28,
29, 30, 31, 32, 33, 34, 35, 36, 37, 40,
41, 42, 43, 44, 45, 46, 47, 48, 49, 50,
51, 52, 53, 54, 55, 56, 57, 58, 59, 60,
61, 62
);
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=1.01..287.75 rows=56 width=8)
-> Index Only Scan using state_attributes_pkey on state_attributes (cost=0.43..253.34 rows=56 width=8)
Index Cond: (attributes_id = ANY ('{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,23,25,26,27,28,29,30,31,32,33,34,35,36,37,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62}'::bigint[]))
-> Limit (cost=0.57..0.59 rows=1 width=8)
-> Index Only Scan using ix_states_attributes_id on states (cost=0.57..2964.22 rows=135045 width=8)
Index Cond: (attributes_id = state_attributes.attributes_id)
(6 rows)
If we re-write it as a lateral join its almost an order of magnitude faster
It probably makes more sense to set slow_range_in_select for PostgreSQL because it will be nearly as fast and we don't have to support another code branch
It probably makes more sense to set
slow_range_in_selectfor PostgreSQL because it will be nearly as fast and we don't have to support another code branch
I think, in the sense of your explanation and the relatively high effort for maintaining an additional code branch, this is probably by far the best solution!
I ended up optimizing this a bit more in https://github.com/home-assistant/core/pull/133752