architecture icon indicating copy to clipboard operation
architecture copied to clipboard

Logbook storage

Open balloob opened this issue 5 years ago • 18 comments

On large datasets, our logbook sucks. Loading takes a lot of time, filtering takes a lot of time and any result set larger than a day just dies. Can we do better? Yes, we can!

In fact, we already have a mechanism to do better, just not so standardized: the logbook_entry event.

All firings of this event are shown in the logbook. This is also the event that is fired when you call the logbook.log service or async_log_entry method in logbook component (used by the automation component).

Here is my proposal: we're going ALL IN on this. And with ALL IN, I mean, the logbook is going to show only the logbook_entry events.

We're going to do this by doing the work of creating logbook entries as the events are happening. We no longer do any work on the fly when fetching the results, we have a few advantages:

  • We can easily link multiple events together using context. That way we can say control was done via Alexa etc.
  • We can start including the message translation key and placeholders so we can work with internationalization in the frontend
  • We can stream the logbook in a Lovelace card or logbook panel.
  • We can start storing other relevant information with the logbook entry, like entity registry ID (in case the entity ID is changed in the future), and in future current Area ID.

Downside:

  • Storing more events
  • We might be creating logbook entries for events never seen by a user

Storage:

Let's think a bit ahead and future proof this system. We need to be able to filter logbook by

  • context id
  • user
  • area (soon)
  • device
  • entity id
  • entity domain
  • time it happened
  • refresh token used to trigger the command
  • How it was triggered (Automation, Google, Alexa)

So we should store these events inside their own table, just like we do with states. They can have a primary key pointing at the event.

About related entities:

I think that it makes sense to make 2 types of logbook entries: primary and secondary. That way we don't have to stop the creation of a logbook entry until the script is finished, but we can keep creating entries that have a "parent logbook entry". We should do this by relying on context. The first event with a context is primary, all subsequent are secondary.

balloob avatar Nov 21 '18 10:11 balloob

I don't think it's a great idea to do this on the premise of performance. The real underlying problem is just that our recorder is really slow, and if we try and fix it by pushing more into the recorder, we're going to exacerbate the existing performance problems for everything else that uses the recorder.

My preference would be to first tackle normalizing our existing recorder schema, splitting state attributes to tables where we can so they can be indexed and queried. We're eventually going to have to address the underlying performance issues of the recorder either way, and IMO we should do it before committing to additional de-normalized storage strategies. It might remove the need to support the additional complexity altogether.

emlove avatar Nov 21 '18 15:11 emlove

I think we should put history log and current state to different db(or table). This might makes startup faster.

fanthos avatar Nov 22 '18 08:11 fanthos

I think normalizing the database would be a mistake.

Puts on his data warehouse hat.

If we normalize the data, we're going to lose important historical data. For example, if the area of a device is changed, in a normalized table, now all old records are going to show that the device was in the new area.

If we want to keep historical relevant information and split to multiple tables, we should leverage a star schema. Area will then be a dimension and whenever any attribute of an area changes, we will insert a new row into the dimension. The area ID is no longer the primary key but just one of the attributes. The fact table would reference the area by a surrogate key, which is added to uniquely identify each iteration of an area.

But I think that we should not aim to build a full data warehouse into our recorder.

If we normalize the data, we're going to suffer performance, because now we need to join tables. If there is a single table with logbook entries, containing all historical relevant information embedded, we query 1 table, leverage 1 relevant index and it's fast. More data to store, faster to retrieve.

The current challenge with our logbook is that we fetch all events into memory, then process them to see which are relevant, then convert them to JSON, then send them off to the user. With the proposed approach, the processing has already happened, and we can target a specific table with logbook entries containing indices optimized for our logbook queries.

So instead of fetching 20k events, converting to SQLAlchemy objects, filtering and processing as needed, we fetch 1k events and send as JSON.

balloob avatar Nov 22 '18 12:11 balloob

But I think that we should not aim to build a full data warehouse into our recorder.

Care to elaborate on this? I'm not sure I understand why we wouldn't want our recorder to be the best way to store Home Assistant history that we can confidently recommend to users. Right now it's just not possible to use the recorder for long-term storage, and duplicating even more data is going to make it worse.

If we normalize the data, we're going to suffer performance, because now we need to join tables. If there is a single table with logbook entries, containing all historical relevant information embedded, we query 1 table, leverage 1 relevant index and it's fast. More data to store, faster to retrieve.

So every time we need a new view into history, is the plan to add a new table that contains the denormalized data we need for that query, write a migration to backfill the table, and just keep adding tables?

You keep saying JOIN like it's a curse word. Indexed joins are fast and efficient. With the right compound index it's just as fast as a normal select. We can build specific compound indexes for each query we need to perform, which has the exact same performance benefit of splitting into one table for each query, but only requires additional storage for the indexed keys. All of the data does not need to be duplicated completely.

IMO the star schema we're discussing here would be a much better change. There's no need to normalize anything that would cause us to lose historical data, but exposing state details to be queriable means we can actually build the indexes we need for efficient lookup.

emlove avatar Nov 22 '18 14:11 emlove

Alternatively, if we really want to store completely de-normalized results for each query we support, we should probably not try and fit it into a relational database, and look for a storage solution more optimized for that case.

emlove avatar Nov 22 '18 16:11 emlove

As with every data storage decision, we are either optimizing for the reader or the writer of the data.

My proposal was about optimizing for the logbook reader, querying a single table and hitting a single index is as fast as you can get it. JOINs can get expensive when joining multiple tables and filtering on a bunch of columns. Big database engines are good at JOINs and indices, but we're running SQLite, on a Pi. (I've been trying to find benchmarks but sadly been unable to find any so I can't back this point up.)

About the current table structure

Home Assistant was originally an event sourced system. Meaning that every change is recorded via an event. By replaying all events up to a point in time, you could exactly determine the exact state, however the exact state itself was never stored. Keeping the events instead of the state is very important information for machine learning. We need to know all events that lead to something to happen to be able to find causality.

(SIdenote: Over time we have strayed away from being complete event sourced by storing a bunch of meta data outside of the state/event/services core and thus don't announce full change sets on the event bus: entity registry, device registry, auth.)

Because of event sourcing we store the events in the event table instead of just maintaining the last state of an entity in a table. Since events are very bare bones, a type and unstructured data, there is not much to normalize, unless you want to go super all in with a (fired,type) table and then either a column for each piece of data that needs to be stored for any event type or an extra table with columns for each event type. Neither seems feasible as we don't have a fixed list of events to begin with.

However, for the history component we did actually wanted to filter out certain state changed events based on attributes, so we broke that event into it's own table. The reason this got it's own table because we wanted to check 2 things:

  • filter by entity_id
  • only get the states where the state changed (and not one of it's attributes), checked by making sure last_changed == last_updated

And now for the logbook, my proposal is to add another table that contains more information for the logbook_entry event, with again as goal that a component can just target their own table.

I think that a table per component is more in the spirit of how the rest of Home Assistant is structured, independent silos whose decisions don't impact one another. However, I am also open for new ideas. So when you talk about normalization, what exactly do you have in mind? Would you want to use 3NF to structure the states table? Or also include the events table?

balloob avatar Nov 22 '18 19:11 balloob

As we're discussing this, I'm thinking we should consider NoSQL document store options such as MongoDB, etc. It seems like we have a use-case that's exactly suited for those types of scenarios. We have a flexible schema that's very difficult to express in a relational database, and we basically don't need any of the advantages of a relational database. We're never actually trying to select across multiple logical tables, we're only looking up historical data.

I'd personally think it's worthwhile to explore that option long-term. It would probably let us step back to a single events table, since all the state change data would now be queriable, and we can add specific indices on any of the nested data. That being said, someone would have to build it and be willing to throw it out if the benchmarks or code didn't turn out like we hoped.

That being said, I'd also support this original logbook table proposal as a stopgap to mitigate the current problems. A new table would at least buy us more time, and doesn't have much downside if we decide to remove it in the future. I know I don't have time right now to explore any of my lofty complete rewrite ideas, and I wouldn't want to block a fix for the present.

emlove avatar Nov 23 '18 15:11 emlove

@armills yesterday on discord we had a general consensus towards InfluxDB.

The issue is, it is not appropriate to be running on an rpi with an SD card. :(

dgomes avatar Nov 23 '18 15:11 dgomes

The issue is, it is not appropriate to be running on an rpi with an SD card. :(

I ran InfluxDB on a Pi (with MySQL also on the same Pi) for almost 2 years with no issues. Though I also eliminated chatty sensors that throw state changes around every couple seconds or every minute

cgarwood avatar Nov 23 '18 17:11 cgarwood

my influxDB is currently using 250Mb (25% of a RPi3), try cold rebooting influx on an SD ;)

dgomes avatar Nov 23 '18 20:11 dgomes

Though I also eliminated chatty sensors that throw state changes around every couple seconds or every minute

InfluxDB's purpose is exactly to store such data 😉

balloob avatar Nov 24 '18 09:11 balloob

I went down a bit of a rabbit hole today with a notion of speeding up the history loading, posted in #devs_backend on discord about it, and was linked here.

I don't think aggressive normalising is a good idea for various reasons already mentioned. My initial reaction when I saw the schema was to vertically partition the events table, which i'll describe below.

I've done this before on with large amounts of timeseries data in postgres and it's worked very well. (IRCCloud used partitioned postgres successfully before switching to cassandra, for >10billion row tables)

Introducing an LSM-style database like influx/rocks/leveldb would not be unreasonable in a server-side environment, but I'd be wary of doing this for home assistant because of the increased operational complexity and associated support pain. One SQL db is much easier to manage and reason about.

Consider if you can live without an ORM, and dropping support for most DB backends

In my opinion using an ORM like sqlalchemy, and supporting various DB backends (pg, mysql, mssql, etc..) has become a drag and is severely restricting what database optimisations you can do.

How partitioning works if done by hand

Were you to implement partitioning by hand, you'd pick a time interval, let's say 1 day, and have an events table per interval. eg: events_20180226, events_20180227, etc. (bear with me..)

Now when inserting an event, insert to the correct table by checking the day of the time_fired timestamp and picking the table accordingly.

When querying for historical event data, ensure you only query the appropriate sub tables for the date range in question, and combine the results. "Last 10 events" would probably only hit today's table. "Last 24hrs of events" would probably span today and yesterday's tables, and so on.

When purging old data (purge_keep_days) you just DROP the relevant old tables, which is an O(1) operation, unlike selective DELETE from a giant single table, which takes forever.

The indexes for the most recent daily table(s) are relatively small, and tend to be in memory, so access to recent history data will be much faster, and going back a few days will only hit a limited amount of child tables, instead of traversing a one-giant-table sized index.

Partitioning is easy in postgresql..

Implementing this in Postgresql got even easier recently in postgres 10. Your events table is technically empty, and you make child tables like events_20180226 with CREATE TABLE events_20180226 ... INHERITS(events) then add a CHECK CONSTRAINT which enforces the event timestamp falls on the day from the table name. In reality I would change events.event_id from a sequence to a unix timestamp of the event, and use that as the primary key and the check constraint.

Inserting into events will automatically insert into the correct child table behind the scenes (based on the check constraints), and when you query from events, the query planner knows to only query the child tables that hold data for the date range you are selecting.

This makes everything really fast because you tend to only be operating on one or two small child tables and their associated indexes.

Other benefits of just using postgres and ditching the ORM

Postgres has pretty neat json type support, so you could still use the same events schema, but change the event_data to a json type, which would allow functional indexes. You could create an index on the entity_id field from within the json document, for example, without duplicating that field into its own column.

Summary

I think vertically partitioning the events table will solve pretty much all the scalability issues you currently have with historical event data. I am not familiar enough with hass internals or the direction things are headed to confidently suggest any other schema changes for now.

Best way to do that IMO is only support one database type: postgres. Ditch sqlalchemy to allow you to use postgres specific features like table partitioning and json document support.

Given hassio's increase in popularity, I think it would be reasonable to pick a single database. Anyone using hassio distribution won't know or care, because it will be preconfigured.

RJ

RJ avatar Feb 26 '19 12:02 RJ

Just a side note. It should really be using a database optimized for time series (influxdb are one example ). They are designed around queries around time, can filter resultsets on query to only report a limited resolution on a request (ie one sample per month)

But yea, one database supported would IMHO also make more sense.

elupus avatar Feb 26 '19 13:02 elupus

Just a side note as well, InfluxDB 2.0 (currently Alpha), will ship as a single binary full stack (including Chronograph, etc). Which might be a great option for this case (easy setup, cross-platform).

frenck avatar Feb 26 '19 14:02 frenck

My proposal was about optimizing for the logbook reader, querying a single table and hitting a single index is as fast as you can get it. JOINs can get expensive when joining multiple tables and filtering on a bunch of columns. Big database engines are good at JOINs and indices, but we're running SQLite, on a Pi. (I've been trying to find benchmarks but sadly been unable to find any so I can't back this point up.)

To counter this argument (theoretically, I haven't benchmarked). A Pi has limited bandwidth to the SD card but because it's flash storage, seek times should be very low. So loading less data and joining it could actually be beneficial.

Having that said, I'm currently running it on an old I7 2600k system (without an SSD) and I don't have any spare Pi's to use for benchmarking so I can't easily test the exact case. The performance is still abominable though...

With the amount of indexed columns I would expect the insert/update performance to drop quite a bit as well. Especially since a few of these columns benefit very little from being denormalized.

wolph avatar Jul 27 '19 15:07 wolph

I decided to do a little bit more analysis to make the current state a bit more apparent.

When looking at the amount of unique values I would argue that entity_id at the very least is a prime suspect for normalization because it is indexed as well.

Looking at the amount of indices I wonder, are all of these indices actually used? If not, it will negatively impact the insert/update performance for the tables as well.

Anyhow, seeing as we can expect low-end systems to run this I would guess it's still beneficial to denormalize at least some of the large low cardinality columns regardless of indices. Perhaps even event_data would be a contender since it's exceptionally large. Perhaps combined with an indexed hash(event_data) to make sure we don't blow up the index size for that column.

events

column_name index unique_percentage unique_values data_type max_length total_size average_size percentage_of_table
event_id TRUE 100% 5371550 integer 20 MB 4 1%
event_type TRUE 0% 28 character varying 32 67 MB 13.13380998 3%
event_data FALSE 33% 1788243 text 1311 MB 255.918224 68%
origin FALSE 0% 1 character varying 32 31 MB 6 2%
time_fired TRUE 100% 5371550 timestamp 41 MB 8 2%
created FALSE 100% 5371550 timestamp 41 MB 8 2%
context_id TRUE 100% 5352806 character varying 36 169 MB 33 9%
context_user_id TRUE 0% 3 character varying 36 11 kB 33 0%

states

column_name index unique_percentage unique_values data_type max_length total_size average_size percentage_of_table
state_id TRUE 100% 525798 integer 2054 kB 4 1%
domain FALSE 0% 22 character varying 64 3718 kB 7.241706891 2%
entity_id TRUE 0% 595 character varying 255 15 MB 30.11909326 9%
state FALSE 25% 132173 character varying 255 3150 kB 6.134924439 2%
attributes FALSE 17% 90165 text 92 MB 184.1937854 53%
event_id TRUE 100% 525798 integer 2054 kB 4 1%
last_changed FALSE 83% 438738 timestamp 4108 kB 8 2%
last_updated TRUE 100% 525798 timestamp 4108 kB 8 2%
created FALSE 100% 525798 timestamp 4108 kB 8 2%
context_id TRUE 99% 521056 character varying 36 17 MB 33 10%
context_user_id TRUE 0% 1 character varying 36 3762 bytes 33 0%

Where the (Postgres only) queries are generated by:

SELECT
'select
''' || c."table_name" || ''' AS table_name,
' || c."ordinal_position" || ' AS "#",
''' || c."column_name" || ''' AS column_name,
''' || c."data_type" || ''' AS data_type,
''' || COALESCE(c."character_maximum_length", -1) || ''' AS max_length,
 pg_size_pretty(sum(pg_column_size(' || c."column_name" || '))) as total_size,
 avg(pg_column_size(' || c."column_name" || ')) as average_size,
 sum(pg_column_size(' || c."column_name" || ')) * 1.0 / pg_relation_size(''' || c."table_name" || ''') as percentage_of_table,
 -- SUM(CASE WHEN ' || c."column_name" || ' IS NULL THEN 0 ELSE 1 END) AS non_null_values,
 -- SUM(CASE WHEN ' || c."column_name" || ' IS NULL THEN 0 ELSE 1 END) * 1.0 / COUNT(*) AS non_null_percentage,
 -- SUM(CASE WHEN ' || c."column_name" || ' IS NULL THEN 1 ELSE 0 END) AS null_values,
 -- SUM(CASE WHEN ' || c."column_name" || ' IS NULL THEN 1 ELSE 0 END) * 1.0 / COUNT(*) AS null_percentage,
 COUNT(DISTINCT ' || c."column_name" || ') AS unique_values,
 COUNT(DISTINCT ' || c."column_name" || ') * 1.0 / COUNT(*) AS unique_percentage
from ' || c."table_name" || '
UNION ALL
'
FROM
	information_schema.COLUMNS C 
WHERE
	TABLE_NAME IN ('states', 'events')
	ORDER BY table_name, ordinal_position

wolph avatar Jul 27 '19 21:07 wolph

Would it be possible just to use influxdb for all history graphs? The data is already written there anyway....

runningman84 avatar Apr 09 '20 21:04 runningman84

Would it be possible just to use influxdb for all history graphs? The data is already written there anyway....

That would require either:

  • maintaining both the regular and the influxdb system
  • force everyone to use influxdb as well

I'm not sure the team is looking for the maintenance hog of maintaining two systems. And with low-end systems such as the raspberry pi I don't think requiring influxdb is reasonable.

Just my two cents though :)

wolph avatar Apr 09 '20 21:04 wolph

This architecture issue is old, stale, and possibly obsolete. Things changed a lot over the years. Additionally, we have been moving to discussions for these architectural discussions.

For that reason, I'm going to close this issue.

../Frenck

frenck avatar May 11 '23 14:05 frenck