core icon indicating copy to clipboard operation
core copied to clipboard

Database migration fails to complete timestamp conversions

Open werkstrom opened this issue 1 year ago • 42 comments

The problem

In the latest database migration, the database model changes, breaking functions (external to HA) that uses the database as a source. I've tried to find info on the change, but I cannot find it. Referring to #89069 and #90819 I thought maybe there was a change of the date format so I reported that on #88942 but was told it had nothing to do with statistics data. So, I'll be glad to contribute to finding this, but I need some help on finding out what actually changed in the DB migration to be able to help determine what is going on.

For context, I use the MariaDB instance on HA for controlling heating and generally power usage based on hourly rates (nordpool). Based on the day ahead prices favorable times for using f.ex. the water heater or the dehumidifyer is calculated and most unfavorable hours are calculated to reduce HVAC usage. In these calculations HA data is used, and this DB migration breaks those calculations. So the DB integrity is rather important.

I've reverted back to 22.2 again as to have my system up and running, but I'll gladly upgrade again to try to figure out what's going on and mitigate the issue. Just need some input on where to look.

What version of Home Assistant Core has the issue?

22.4.4

What was the last working version of Home Assistant Core?

2023.2.?

What type of installation are you running?

Home Assistant OS

Integration causing the issue

No response

Link to integration documentation on our website

No response

Diagnostics information

No response

Example YAML snippet

No response

Anything in the logs that might be useful for us?

No response

Additional information

No response

werkstrom avatar Apr 16 '23 07:04 werkstrom

Hi, same problem for me, details here : https://community.home-assistant.io/t/upgrade-to-2023-4-4-with-mariadb-no-more-history-slow-queries-on-db-is-it-the-migration/560627

prigal avatar Apr 16 '23 08:04 prigal

@bdraco, could we have a look at this. After several tries and days of troubleshooting it turns out the issue here is indeed this migration to Epochs instead of proper dates. As stated above the migration breaks the database. More specifically the statistics table. The conversion is not complete leaving some records (85.380 out of 2.139.722 total records in my DB) with the old start value and NULL in start_ts, while the rest is converted and instead has NULL in start.

image

And there is continous locks. 12 since I upgraded 12 hours ago.

image

werkstrom avatar Apr 30 '23 19:04 werkstrom

Can you check what is causing the lock timeouts with innodb?

https://www.percona.com/blog/show-innodb-status-walk-through/

bdraco avatar Apr 30 '23 20:04 bdraco

Hey there @home-assistant/core, mind taking a look at this issue as it has been labeled with an integration (recorder) you are listed as a code owner for? Thanks!

Code owner commands

Code owners of recorder can trigger bot actions by commenting:

  • @home-assistant close Closes the issue.
  • @home-assistant rename Awesome new title Renames the issue.
  • @home-assistant reopen Reopen the issue.
  • @home-assistant unassign recorder Removes the current integration label and assignees on the issue, add the integration domain after the command.

(message by CodeOwnersMention)


recorder documentation recorder source (message by IssueLinks)

home-assistant[bot] avatar Apr 30 '23 20:04 home-assistant[bot]

Can you check what is causing the lock timeouts with innodb?

https://www.percona.com/blog/show-innodb-status-walk-through/

Am restoring from backup again. Need to have statistics running for a bit. Super important to have the data. But I'll check ASAP tomorrow (will try upgrading again if nothing is left in the logs on the DB server).

werkstrom avatar Apr 30 '23 21:04 werkstrom

Without the lock data I'd take a guess that its one or more of the following

  1. You have something that is accessing the database frequently that is tying it up
  2. The host system is slow or has a slow disk
  3. The system is generally busy / resource constrained already so the transaction cannot finish in a reasonable time

There are a few transactions that do 250k rows at a time (min throughput 5000/rows/s). We can reduce them to 100k to give your system a better chance of being able to finish the transaction before the deadline expires (min throughput 2000/row/s).

bdraco avatar Apr 30 '23 21:04 bdraco

Also what do you have the lock wait timeout set to? Assuming its 50s, your system would need to be able to do modify 5000/rows/s with 250k. The change in #92312 will reduce that requirement to 2000/rows/s.

MariaDB [(none)]> select @@innodb_lock_wait_timeout;
+----------------------------+
| @@innodb_lock_wait_timeout |
+----------------------------+
|                         50 |
+----------------------------+
1 row in set (0.000 sec)

At least on my test system I'm able to get it to do ~125000 rows per second.

bdraco avatar Apr 30 '23 21:04 bdraco

1: Nope. Quite sure I haven't anything that is that frequent. I do look up every 10 minutes or so (can check acctual frequency when I have the system up again). But that is not very often I would argue. 2: Samsung SSD, so no ;) 3: It's an RPi 4 running only HA (and addons) but it's really very low on both CPu and RAM usage. I'll screenshot Glances once the system is up again (or tomorrow... really sleepy).

I have this MariaDB set up default (I'm running my own stuff on MSSQL) from HA. So whatever everyone else has, so do I . Unless I've changed it by mistake.

werkstrom avatar Apr 30 '23 21:04 werkstrom

5000/rows/s when its low on ram and cpu is going to be a challenge for an RPi4 given the size of your data.

The MariaDB addon is not tuned (pretty much at all) by default for a RPI so its not going to be the best experience without doing some manual tuning

bdraco avatar Apr 30 '23 21:04 bdraco

I also tried copying the statistics table before resoring and when querying it un indexed for stats last 24 h it takes 10 sec. Ad index on start_ts 0.192, add desc index 0.100 but I don't know if the query gets cahed (shouldn't be after adding an index, but I don't know). This is the same table that takes 7+ sec in HA DB. On the same machine.

werkstrom avatar Apr 30 '23 21:04 werkstrom

Thats really slow. 24h of stats should take around a second at the most (unless you have a massive amount of statistics)

bdraco avatar Apr 30 '23 21:04 bdraco

Yeah 7 sec is bonkers. the 0.2 sec or even 0.1 sec I get on the backup with index is more like it. The 13 on an unindexed table of nearly 3.000.000 rows is fine. Since it needs a full table scan on it. I would argue the RPi handles those amounts of data no problem what so ever. Set one up for my son last summer for a school project. Exact same hardware as this HA RPi. And the test on the backed up statistics table kind of proves that, IMHO. I tried to find the actual SQL that is run here but have a hard time navigating the PRs and Github (use DevOps at work). Could you perhaps point me in the right direction?

werkstrom avatar Apr 30 '23 21:04 werkstrom

Its in this file https://github.com/home-assistant/core/blob/dev/homeassistant/components/recorder/migration.py and... https://github.com/home-assistant/core/blob/dev/homeassistant/components/recorder/queries.py But only a small part of it is written out as its mostly rendered by https://www.sqlalchemy.org/

bdraco avatar Apr 30 '23 21:04 bdraco

Should be said I also tried rebuilding the table (optimize in MariaDB I think, right). Did nothing what so ever (exept reclaiming some space)

werkstrom avatar Apr 30 '23 21:04 werkstrom

SHOW ENGINE INNODB STATUS will tell you exactly which query its waiting on.

It should look something like this if its healthy

=====================================
2023-04-30 16:37:55 0xffffa56c9050 INNODB MONITOR OUTPUT
=====================================
Per second averages calculated from the last 7 seconds
-----------------
BACKGROUND THREAD
-----------------
srv_master_thread loops: 0 srv_active, 0 srv_shutdown, 10957 srv_idle
srv_master_thread log flush and writes: 10957
----------
SEMAPHORES
----------
------------
TRANSACTIONS
------------
Trx id counter 923042
Purge done for trx's n:o < 923042 undo n:o < 0 state: running
History list length 12
LIST OF TRANSACTIONS FOR EACH SESSION:
---TRANSACTION (0xffff9b16c680), not started
0 lock struct(s), heap size 1128, 0 row lock(s)
---TRANSACTION (0xffff9b16ba80), not started
0 lock struct(s), heap size 1128, 0 row lock(s)
---TRANSACTION (0xffff9b16ae80), not started
0 lock struct(s), heap size 1128, 0 row lock(s)
--------
FILE I/O
--------
Pending flushes (fsync) log: 0; buffer pool: 0
3049 OS file reads, 5661 OS file writes, 4048 OS fsyncs
0.00 reads/s, 0 avg bytes/read, 0.29 writes/s, 0.29 fsyncs/s
-------------------------------------
INSERT BUFFER AND ADAPTIVE HASH INDEX
-------------------------------------
Ibuf: size 1, free list len 0, seg size 2, 0 merges
merged operations:
 insert 0, delete mark 0, delete 0
discarded operations:
 insert 0, delete mark 0, delete 0
0.00 hash searches/s, 0.00 non-hash searches/s
---
LOG
---
Log sequence number 4685438343
Log flushed up to   4685438343
Pages flushed up to 4678898457
Last checkpoint at  4678898457
0 pending log flushes, 0 pending chkp writes
4628 log i/o's done, 0.29 log i/o's/second
----------------------
BUFFER POOL AND MEMORY
----------------------
Total large memory allocated 150994944
Dictionary memory allocated 929064
Buffer pool size   8112
Free buffers       6090
Database pages     2022
Old database pages 759
Modified db pages  1222
Percent of dirty pages(LRU & free pages): 15.062
Max dirty pages percent: 90.000
Pending reads 0
Pending writes: LRU 0, flush list 0
Pages made young 11, not young 0
0.00 youngs/s, 0.00 non-youngs/s
Pages read 1542, created 474, written 0
0.00 reads/s, 0.00 creates/s, 0.00 writes/s
Buffer pool hit rate 1000 / 1000, young-making rate 0 / 1000 not 0 / 1000
Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
LRU len: 2022, unzip_LRU len: 0
I/O sum[0]:cur[0], unzip sum[0]:cur[0]
--------------
ROW OPERATIONS
--------------
0 read views open inside InnoDB
Process ID=0, Main thread ID=0, state: sleeping
Number of rows inserted 20233, updated 3, deleted 0, read 157840
0.14 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.29 reads/s
Number of system rows inserted 0, updated 0, deleted 0, read 0
0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.00 reads/s
----------------------------
END OF INNODB MONITOR OUTPUT
============================

bdraco avatar Apr 30 '23 21:04 bdraco

Yeah, need to get HA up first.... Takes forever to restore... Getting a bit nervous... But. It's way past my bedtime so I'll have to check tomorrow. I'll get back to you on that. Thanks for assisting.

werkstrom avatar Apr 30 '23 21:04 werkstrom

FYI, there appear to be rows with only dates, only TS and then mixed... image

werkstrom avatar Apr 30 '23 21:04 werkstrom

So At it again:

  1. Did "OPTIMIZE" on all tables in the DB. Took quite some time for some of them.
  2. Restarted the HA hardware (RPi4)
  3. Turned off all NodeRED flows
  4. Upgraded
  5. No Locks so far. (4 hours in or so)
  6. No errors (relating to this) in the HA logg

However... Data is still corrupt, after 4 hours. And no queries are running so there should be plenty of time and resources to have this fixed. This is the state of statistics as of now.

image

werkstrom avatar May 01 '23 10:05 werkstrom

Also... What's with the hourly statistics drifting in time. These are updated 48 minutes to late... image

Something clearly happened here, as the time stamps shift column and it's about the time I believe I upgraded again: image Allready 3 minutes late, but that's no biggie. Then the next hour. It is 25 minutes late: image Then the last run is 48 min late. And we are no (as of writing) over an hour late on the next run. The time here is 12:06 UTC, so at 12.0? we should have gotten data for the hour 11-12. (Start time 11). We are still missing data for 10-11. The data in the screenshots are in UTC and fetched like so: image

UPDATE: Appears the recorder has stopped... image No errors in HA though. Clean log (except for "normal" errors on devices not found and such)

werkstrom avatar May 01 '23 12:05 werkstrom

Restarting the Pi gets recorder going again... No errors in the log that I can find. image Those are UTC time, the below image from within HA is in local (CET) time. image

werkstrom avatar May 01 '23 12:05 werkstrom

Maria DB is absolutely using the most CPU, but that is only because everythig else uses basically nothing: image image

werkstrom avatar May 01 '23 12:05 werkstrom

image

The recorder stops again (added a time stamp), nothing in the logs. home-assistant_2023-05-01T12-57-26.687Z.log

EDIT: It didn't stop. It ran, but 15-20 minutes late... 😲

werkstrom avatar May 01 '23 12:05 werkstrom

I can’t remember if I already offered this but if you want to send a database dump of the data pre migration I can try migrating it locally on my laptop and see if it runs into an issue

bdraco avatar May 01 '23 13:05 bdraco

Well, that was very kind. I would need to roll back once again, but thats perhaps the only way. I was thinking about moving the DB to another server just to se what happens.

werkstrom avatar May 01 '23 13:05 werkstrom

You could extract the old version by downloading the backup and pulling our just the MariaDB container backup

I could also try loading the current version as well and seeing if it will finish migrating locally as well.

bdraco avatar May 01 '23 13:05 bdraco

You could extract the old version by downloading the backup and pulling our just the MariaDB container backup

That could be a problem... Appears it's not there...

image

werkstrom avatar May 01 '23 14:05 werkstrom

If you only took a backup of core it won't be in there. You would have had to have made a full backup

bdraco avatar May 01 '23 15:05 bdraco

Also the specific migration query for Statistics is here

https://github.com/home-assistant/core/blob/adaae46178a1abcd0e01ca4432cf9f0f24b9b6fa/homeassistant/components/recorder/migration.py#L1320

It runs that update statement over and over again until no rows are changes. You could also try running it manually. Its very important that your session time is UTC or your will get the wrong results

bdraco avatar May 01 '23 15:05 bdraco

If you only took a backup of core it won't be in there. You would have had to have made a full backup

NodeRed is in there... And the other containers. Appears only MariaDB is missing... But. It says core, so you are most likely absolutely correct.

werkstrom avatar May 01 '23 15:05 werkstrom

Well, that went quick... 26 seconds and done. Thanks so much for assisting. Which tables should I run this i on? [statistics] ofcourse, but what other?

Unfortunatly, recorder is still not functioning correctly, but that should probably be another issue on its own. I'll register one.

Thanks again for helping. 🙏

werkstrom avatar May 01 '23 16:05 werkstrom