cds-dbs
cds-dbs copied to clipboard
[PARKED:] feat: edge database service
Description
This PR implements a PoC for how the pluggable database services can be wrapped together into a multi staged database service.
This implementation uses an SQLiteService
as an in memory edge
database and a PostgresService
as root
database. This is a very basic scenario, but also about how complex most edge database implementations go. With more database services implemented it would be possible to tier the data in more complex configurations. Going from edge
, hot
, warm
to cold
. Or potentially cross region using the same database service type for different levels, but ordered based upon latency from the service to the database instances.
Reading
The main purpose of the edge
database is to act as a feature complete cache. So when reading requests come in it will be mostly handled by the edge
database. When the results are not known to the edge
database the query will be passed on to the next database service.
sequenceDiagram
participant Client
participant CAP
participant SQLite
participant Postgres
Client->>CAP: Books?$top=100
CAP->>SQLite: SELECT Books 100
SQLite-->>CAP: returns 100 Books
CAP-->>Client: return
Client->>CAP: Books?$skip=1000
CAP->>SQLite: SELECT Books SKIP 1000
SQLite-->>CAP: returns 0 Books
CAP->>Postgres: SELECT Books SKIP 1000
Postgres-->>CAP: returns 100 Books
CAP-->>Client: return
Writing
When writing this is done to all databases. As this allows the edge
to stay up to date without synchronizing. But also enforces data consistency as the root
database has to successfully execute the query as well.
sequenceDiagram
participant Client
participant CAP
participant SQLite
participant Postgres
Client->>CAP: Books(201)
CAP->>SQLite: UPDATE Books ID=201
CAP->>Postgres: UPDATE Books ID=201
SQLite-->>CAP: returns 1
Postgres-->>CAP: returns 1
CAP->>Postgres: COMMIT
Postgres-->>CAP: success
CAP->>SQLite: COMMIT
SQLite-->>CAP: success
CAP-->>Client: return
In case of any error on the root
database all transactions are rolled back. In case of an error in the edge
database it is possible to abort the transaction and re construct the edge
database from the root
database state.
sequenceDiagram
participant Client
participant CAP
participant SQLite
participant Postgres
Client->>CAP: Books(201)
CAP->>SQLite: UPDATE Books ID=201
CAP->>Postgres: UPDATE Books ID=201
SQLite-->>CAP: returns 1
Postgres-->>CAP: returns 1
CAP->>Postgres: COMMIT
Postgres-->>CAP: error
CAP->>Postgres: ROLLBACK
CAP->>SQLite: ROLLBACK
SQLite-->>CAP: success
Postgres-->>CAP: success
CAP-->>Client: return
Date life cycle
There is the general concept of Data life cycle definitions. That allow a system to move data through different stages. Generally the stages are called hot
, warm
and cold
.
hot
Data is hot
when it will be regularly read or modified. It is important to keep hot
data easily accessible to maintain fast query processing. In the case of an edge
database it might not be feasible to keep all hot
data on the edge
as generally the life span of hot
data will be on the scale of months. Which will cause trouble when a million rows are generated a day. Therefor having an additional edge
life cycle. Would identify all the data that will be read on a daily time frame.
Storages: HANA
, SQLite
, Postgres
warm
With modern databases relying more on memory stores it has become possible to define warm
data. Which mostly means that the data will still be read on a weekly to monthly basis. The primary method of storing warm
data is on disk. Reducing storage costs for an increased processing time when it is being read.
Storages: HANA NSE
, SQLite disk
, Postgres
cold
Most applications won't have cold
data. This mostly applies to applications which need to be compliant with auditing laws (e.g. financial, medical or legal applications). The most common example is the quarterly or yearly financial report. Which is based upon large datasets that are written once and then read once in several months.
Storages: SPARK
, HADOOP
, CSV
Data life cycle management
The most common method of defining the life cycle of data is to use Date/Time columns.
-
edge
:modifiedAt > add_day($now, -1)
-> modified within the last day -
hot
:createdAt > add_month($now, -1)
-> created within the last month -
warm
:createdAt > add_year($now, -1)
-> created within the last year -
cold
:createdAt < add_year($now, -1)
-> created before the last year
This definition will keep recently updated data in the edge
. While gradually moving data down as time goes on. To move the data a regularly scheduled task is used to deligate the databases to move the relevant data out. For the edge
database this task might run every hour. While for the lower layers it might run on an weekly or monthly basis. Usually the life cycle is linked to partitioning as this allows the database to optimize query execution. This same optimization can be applied to the edge service.
For example the following queries only need to be executed in a single database:
SELECT * FROM Notifications WHERE modifiedAt > add_hour($now, -1); -- all data is in the edge
SELECT * FROM Billings WHERE createdAt < add_month($now, -1); -- all data is in warm storage
SELECT * FROM Billings WHERE createdAt < add_year($now, -2); -- all data is in cold storage
High availability
If there is an issue with the root
database it is possible to switch the service into read only mode and serve requests from the edge
database only. Until the root
database is available again and the service can switch back to normal operations.
If the configuration of the edge
database is done properly and cannot de synchronize from the root
database. It would allow the edge
database to continue as root
of its designated data slice. To later synchronize down to the root
database once it is available again. As this requires extraordinary confidence in the configuration of the edge
database it is not recommended.