cds-dbs icon indicating copy to clipboard operation
cds-dbs copied to clipboard

[PARKED:] feat: edge database service

Open BobdenOs opened this issue 11 months ago • 1 comments

Description

This PR implements a PoC for how the pluggable database services can be wrapped together into a multi staged database service.

This implementation uses an SQLiteService as an in memory edge database and a PostgresService as root database. This is a very basic scenario, but also about how complex most edge database implementations go. With more database services implemented it would be possible to tier the data in more complex configurations. Going from edge, hot, warm to cold. Or potentially cross region using the same database service type for different levels, but ordered based upon latency from the service to the database instances.

Reading

The main purpose of the edge database is to act as a feature complete cache. So when reading requests come in it will be mostly handled by the edge database. When the results are not known to the edge database the query will be passed on to the next database service.

sequenceDiagram
participant Client
participant CAP
participant SQLite
participant Postgres

Client->>CAP: Books?$top=100
CAP->>SQLite: SELECT Books 100
SQLite-->>CAP: returns 100 Books
CAP-->>Client: return

Client->>CAP: Books?$skip=1000
CAP->>SQLite: SELECT Books SKIP 1000
SQLite-->>CAP: returns 0 Books
CAP->>Postgres: SELECT Books SKIP 1000
Postgres-->>CAP: returns 100 Books
CAP-->>Client: return

Writing

When writing this is done to all databases. As this allows the edge to stay up to date without synchronizing. But also enforces data consistency as the root database has to successfully execute the query as well.

sequenceDiagram
participant Client
participant CAP
participant SQLite
participant Postgres

Client->>CAP: Books(201)
CAP->>SQLite: UPDATE Books ID=201
CAP->>Postgres: UPDATE Books ID=201
SQLite-->>CAP: returns 1
Postgres-->>CAP: returns 1
CAP->>Postgres: COMMIT
Postgres-->>CAP: success
CAP->>SQLite: COMMIT
SQLite-->>CAP: success
CAP-->>Client: return

In case of any error on the root database all transactions are rolled back. In case of an error in the edge database it is possible to abort the transaction and re construct the edge database from the root database state.

sequenceDiagram
participant Client
participant CAP
participant SQLite
participant Postgres

Client->>CAP: Books(201)
CAP->>SQLite: UPDATE Books ID=201
CAP->>Postgres: UPDATE Books ID=201
SQLite-->>CAP: returns 1
Postgres-->>CAP: returns 1
CAP->>Postgres: COMMIT
Postgres-->>CAP: error
CAP->>Postgres: ROLLBACK
CAP->>SQLite: ROLLBACK
SQLite-->>CAP: success
Postgres-->>CAP: success
CAP-->>Client: return

Date life cycle

There is the general concept of Data life cycle definitions. That allow a system to move data through different stages. Generally the stages are called hot, warm and cold.

hot

Data is hot when it will be regularly read or modified. It is important to keep hot data easily accessible to maintain fast query processing. In the case of an edge database it might not be feasible to keep all hot data on the edge as generally the life span of hot data will be on the scale of months. Which will cause trouble when a million rows are generated a day. Therefor having an additional edge life cycle. Would identify all the data that will be read on a daily time frame.

Storages: HANA, SQLite, Postgres

warm

With modern databases relying more on memory stores it has become possible to define warm data. Which mostly means that the data will still be read on a weekly to monthly basis. The primary method of storing warm data is on disk. Reducing storage costs for an increased processing time when it is being read.

Storages: HANA NSE, SQLite disk, Postgres

cold

Most applications won't have cold data. This mostly applies to applications which need to be compliant with auditing laws (e.g. financial, medical or legal applications). The most common example is the quarterly or yearly financial report. Which is based upon large datasets that are written once and then read once in several months.

Storages: SPARK, HADOOP, CSV

Data life cycle management

The most common method of defining the life cycle of data is to use Date/Time columns.

  • edge: modifiedAt > add_day($now, -1) -> modified within the last day
  • hot : createdAt > add_month($now, -1) -> created within the last month
  • warm: createdAt > add_year($now, -1) -> created within the last year
  • cold: createdAt < add_year($now, -1) -> created before the last year

This definition will keep recently updated data in the edge. While gradually moving data down as time goes on. To move the data a regularly scheduled task is used to deligate the databases to move the relevant data out. For the edge database this task might run every hour. While for the lower layers it might run on an weekly or monthly basis. Usually the life cycle is linked to partitioning as this allows the database to optimize query execution. This same optimization can be applied to the edge service.

For example the following queries only need to be executed in a single database:

SELECT * FROM Notifications WHERE modifiedAt > add_hour($now, -1); -- all data is in the edge
SELECT * FROM Billings WHERE createdAt < add_month($now, -1); -- all data is in warm storage
SELECT * FROM Billings WHERE createdAt < add_year($now, -2); -- all data is in cold storage

High availability

If there is an issue with the root database it is possible to switch the service into read only mode and serve requests from the edge database only. Until the root database is available again and the service can switch back to normal operations.

If the configuration of the edge database is done properly and cannot de synchronize from the root database. It would allow the edge database to continue as root of its designated data slice. To later synchronize down to the root database once it is available again. As this requires extraordinary confidence in the configuration of the edge database it is not recommended.

BobdenOs avatar Aug 08 '23 23:08 BobdenOs