core icon indicating copy to clipboard operation
core copied to clipboard

Purge causes recorder to stop writing to the DB.

Open peterle00370 opened this issue 1 year ago • 1 comments

The problem

Since the last database migration with a focus on POSTGRESQL, I see almost every day that the recorder stops recording data. Restarting HA brings the recording back, but the data in between is lost. The symptoms are the same as in issue #117263, but the background is different in my opinion. I'm using a POSTGRESQL-database on an other machine. Access to the database itself is therefore easier for me to handle.

According to the SQLALCHEMY debug log, the daily database cleanup begins with a SELECT on the "states" table (blocked to 4000 records), followed by an UPDATE of the "old_state_id", again followed by a DELETE. This sequence repeats itself in the example (see LOG) 20 times in about 30 seconds. This is followed by a "SELECT DISTINCT states.attributes_id FROM states WHERE states.attributes_id IN (.....)" THIS statement is useless in my opinion because the "attributes_id" listed in the IN part of the statement is already "unique", so the result of the "SELECT DISTINCT" corresponds to the values ​​sought in the query. The FULL index scan resulting from the SELECT places a massive load on the database without any apparent purpose and leads to a blockage of the recording. As expected, aborting the SELECT statement on the server side leads to a ROLLBACK of the entire PURGE. However, the data buffered in the meantime is cleanly written to the database and a restart of HA is NOT necessary. The database-purge itself is NOT done. If the SELECT DISTINCT is not canceled after a couple of hours memory-consumption on HA is raising and the only method to get a functioning system is a reboot of the complete HA-system; of course also with a ROLLBACK because of the lost connection........

{EDIT] 2024/19/09 After further searching and many hours later, I found the check for "database_engine.optimizer.slow_range_in_select" in the "purge.py" source... and the comment about MariaDB's poor optimizer.... I thought "it can't get any worse" so I looked for the place where it is set.... That led to "util.py"... There I added (in my case in line 601) the statement "slow_range_in_select = True" and started a PURGE. I THINK the purge takes longer, but it works!

What version of Home Assistant Core has the issue?

core-2024.9.1

What was the last working version of Home Assistant Core?

No response

What type of installation are you running?

Home Assistant OS

Integration causing the issue

recorder

Link to integration documentation on our website

Recorder

Diagnostics information

home-assistant.log

Example YAML snippet

No response

Anything in the logs that might be useful for us?

No response

Additional information

EXPLAIN of a SHORT form of the "SELECT DISTINCT"-statement:

Unique (cost=0.57..3452037.82 rows=2198 width=8) -> Index Only Scan using ix_states_attributes_id on states (cost=0.57..3448204.77 rows=1533221 width=8) Index Cond: (attributes_id = ANY ('{4734983,21,4816918,23,4751383,9245058,22,24,9378572,9378573,9378574,4800548,9378578,9378579,58,64,4735040,66,67,68,7787168,7822000,6619208,74,7822001,76,77,4767820,9378583,7822002,9378584,83,85,86,92,7787169,94,6619235,9378589,9378590,112,114,115,9378591,7787170,4735095,6783100,4735104,4866181,4735114,7864475,7864476,7864477,6783139,9378604,9378606,9378607,6799568,4899078,9339143,4822851,4771823,7536906,4800789,4735253,4735266,4768038,8257853,9355597,4817229,7995733,7995734,7995735,7639693,345,349,350,4850055,8028605,8028606,8028607,7520730,9378664,7553530,4735485,7979522,7979523,7979524,9378671,9378678,9378679,4804398,3981877,9378680,4735544,9378681,9378682,9378685,6701714,7717528,7717529,7717530,8045216,4754547,9241256,4850348,4768434,9378705,6636221,7848643,9375452,7717573,7717574,7717575,4866756,9378709,773397}'::bigint[]))
JIT: Functions: 2 Options: Inlining true, Optimization true, Expressions true, Deforming true

peterle00370 avatar Sep 16 '24 20:09 peterle00370

Hey there @home-assistant/core, mind taking a look at this issue as it has been labeled with an integration (recorder) you are listed as a code owner for? Thanks!

Code owner commands

Code owners of recorder can trigger bot actions by commenting:

  • @home-assistant close Closes the issue.
  • @home-assistant rename Awesome new title Renames the issue.
  • @home-assistant reopen Reopen the issue.
  • @home-assistant unassign recorder Removes the current integration label and assignees on the issue, add the integration domain after the command.
  • @home-assistant add-label needs-more-information Add a label (needs-more-information, problem in dependency, problem in custom component) to the issue.
  • @home-assistant remove-label needs-more-information Remove a label (needs-more-information, problem in dependency, problem in custom component) on the issue.

(message by CodeOwnersMention)


recorder documentation recorder source (message by IssueLinks)

home-assistant[bot] avatar Sep 17 '24 06:09 home-assistant[bot]

I didn't really see it written, but is it the same when the recorder hangs during autopurge after some 4AM ?

sezlony avatar Sep 30 '24 12:09 sezlony

I didn't really see it written, but is it the same when the recorder hangs during autopurge after some 4AM ?

As I understand the "4am hang", YES. But since I have POSTGRES running on another machine and there are some "midnight jobs" there, the HA purge starts at 00:12... (also because that counts as such FOR ME). AROUND 3.5 - 4 hours after starting, there is a sharp increase in memory requirements on HA and shortly afterwards the "recorder hang"... AFTER this point, only a restart of the HA system has ever helped to get a recording again. On the SQL server, the load AVG rises to 5-7 during this period... The "SQL server" is a RASPI-4B with 8GB. It plays a small NAS, DNS for the LAN, the SQL server and little bit system-playground for me...... But it handles it VERY well. Load AVG is usually <= 0.5....

Best regards, Peter

peterle00370 avatar Sep 30 '24 17:09 peterle00370

The select distinct optimizes poorly with PostgreSQL because it doesn't support loose/skip index scans https://wiki.postgresql.org/wiki/Loose_indexscan

bdraco avatar Dec 21 '24 04:12 bdraco

select 
  attributes_id 
from 
  states 
where 
  attributes_id in (
    1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 
    15, 16, 17, 18, 19, 23, 25, 26, 27, 28, 
    29, 30, 31, 32, 33, 34, 35, 36, 37, 40, 
    41, 42, 43, 44, 45, 47, 48, 49, 50, 51, 
    52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 
    62
  ) 
group by 
  attributes_id;

That might be a bit faster

bdraco avatar Dec 21 '24 04:12 bdraco

doesn't really help

# explain select 
  attributes_id 
from 
  states 
where 
  attributes_id in (
    1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 
    15, 16, 17, 18, 19, 23, 25, 26, 27, 28, 
    29, 30, 31, 32, 33, 34, 35, 36, 37, 40, 
    41, 42, 43, 44, 45, 47, 48, 49, 50, 51, 
    52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 
    62
  ) 
group by 
  attributes_id;
                                                                                                      QUERY PLAN                                                                                                       
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Group  (cost=0.57..33703.02 rows=2616 width=8)
   Group Key: attributes_id
   ->  Index Only Scan using ix_states_attributes_id on states  (cost=0.57..30253.55 rows=1379788 width=8)
         Index Cond: (attributes_id = ANY ('{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,23,25,26,27,28,29,30,31,32,33,34,35,36,37,40,41,42,43,44,45,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62}'::bigint[]))
(4 rows)

# explain SELECT 
  DISTINCT states.attributes_id 
FROM 
  states 
WHERE 
  states.attributes_id IN (
    1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 
    15, 16, 17, 18, 19, 23, 25, 26, 27, 28, 
    29, 30, 31, 32, 33, 34, 35, 36, 37, 40, 
    41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 
    51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 
    61, 62
  );
;
                                                                                                        QUERY PLAN                                                                                                        
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=0.57..34314.72 rows=2616 width=8)
   ->  Index Only Scan using ix_states_attributes_id on states  (cost=0.57..30802.53 rows=1404875 width=8)
         Index Cond: (attributes_id = ANY ('{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,23,25,26,27,28,29,30,31,32,33,34,35,36,37,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62}'::bigint[]))
(3 rows)

bdraco avatar Dec 21 '24 04:12 bdraco

# explain SELECT 
  inner_state_attributes.attributes_id AS attributes_id 
FROM 
  state_attributes 
  JOIN LATERAL (
    SELECT 
      states.attributes_id as attributes_id 
    FROM 
      states 
    WHERE 
      states.attributes_id = state_attributes.attributes_id 
    LIMIT 
      1
  ) AS inner_state_attributes ON true
WHERE 
  state_attributes.attributes_id IN (
    1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 
    15, 16, 17, 18, 19, 23, 25, 26, 27, 28, 
    29, 30, 31, 32, 33, 34, 35, 36, 37, 40, 
    41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 
    51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 
    61, 62
  );
                                                                                                        QUERY PLAN                                                                                                        
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=1.01..287.75 rows=56 width=8)
   ->  Index Only Scan using state_attributes_pkey on state_attributes  (cost=0.43..253.34 rows=56 width=8)
         Index Cond: (attributes_id = ANY ('{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,23,25,26,27,28,29,30,31,32,33,34,35,36,37,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62}'::bigint[]))
   ->  Limit  (cost=0.57..0.59 rows=1 width=8)
         ->  Index Only Scan using ix_states_attributes_id on states  (cost=0.57..2964.22 rows=135045 width=8)
               Index Cond: (attributes_id = state_attributes.attributes_id)
(6 rows)

If we re-write it as a lateral join its almost an order of magnitude faster

bdraco avatar Dec 21 '24 04:12 bdraco

It probably makes more sense to set slow_range_in_select for PostgreSQL because it will be nearly as fast and we don't have to support another code branch

bdraco avatar Dec 21 '24 04:12 bdraco

It probably makes more sense to set slow_range_in_select for PostgreSQL because it will be nearly as fast and we don't have to support another code branch

I think, in the sense of your explanation and the relatively high effort for maintaining an additional code branch, this is probably by far the best solution!

peterle00370 avatar Dec 21 '24 08:12 peterle00370

I ended up optimizing this a bit more in https://github.com/home-assistant/core/pull/133752

bdraco avatar Dec 21 '24 22:12 bdraco