flow icon indicating copy to clipboard operation
flow copied to clipboard

Database Synchronization Tool

Open norberttech opened this issue 10 months ago • 0 comments

One of the challenges of working with large systems is state reproducibility. In order to fix any bug, test something manually, or simply play with the system locally, we need data.

At the beginning, using fixtures might be just enough, but with the growth, adding fixtures is becoming problematic. Additionally, fixtures are not particularly useful when it comes to reproducing bugs.

There are different ways to approach those challenges, for example:

  • staging environment
  • db dumps
  • manual data sync

There is also another solution, automated selective data sync through ETL. There are few ways of synchronizing db tables, each of them should be pretty much an implementation of SynchronizationStrategy interface.

Limit Offset

This is the easiest one, below list of parameters it should take:

  • string $table - name of db table (external and local should be the same and table must exists)
  • array<SortBy> $sortBy - list of columns used to sort the results by, at least one is mandatory
  • ?int $maxRows - maximum number of rows to sync
  • int $readBatchSize = 1000 - how many rows to read at once from an external table
  • int $loadBatchSize = 1000 - how many rows to write at once to a local table

Alternatively instead of $table and $sortBy we can also allow to pass instance of QueryBuilder for more advanced use cases.

Time Range

  • string $table - name of db table (external and local should be the same and table must exists)
  • array<SortBy> $sortBy - list of columns used to sort the results by, at least one is mandatory
  • string $dateColumn - name of the column used to filter the results by date
  • \DatePeriod $range - time range to sync
  • ?int $maxRows - maximum number of rows to sync
  • int $loadBatchSize = 1000 - how many rows to write at once to a local table

The idea is simple, let say we want to sync last month of data with 1d interval. What we need to do is to create standalone query for each day, and then run it in a loop.

Alternatively instead of $table and $sortBy we can also allow to pass instance of QueryBuilder for more advanced use cases.

Related Table

First two strategies are pretty straightforward, time range is a bit more complex but it performs much better than limit/offset on large tables. The problem starts when we need to sync a table that is related to another table. Example:

  • Orders
  • Order Line Items

Since we don't know upfront how many order items are in each order we can't use any of the above strategies.

Instead, we can read the identifier from a local version of Orders (previously synced with Limit/Offset) and then use it to filter the results from the external table.

  • string $targetTable - name of db table (external and local should be the same and table must exists)
  • string $sourceTable - name of db table (external and local should be the same and table must exists)
  • array<string> $targetIdentifiers - list of columns used to filter the results by, at least one is mandatory
  • array<string> $sourceIdentifiers - list of columns used to filter the results by, at least one is mandatory

Sanitization

Additionally, we should add SanitizeStrategy interface to sanitize the data before writing it to a local table. The implementation should simply take one of other strategies and list of columns that must be sanitized.

  • SynchronizationStrategy $strategy - instance of SynchronizationStrategy
  • array<string> $columns - list of columns to sanitize
  • string $mask = '*****' - mask to use for sanitization

From a technical point of view, SynchronizationStrategies should be simply used to create a DataFrame so the interface can look like this:

interface SynchronizationStrategy
{
    public function toDataFrame(): DataFrame;
}

On top of that, we will need something that will execute those synchronization strategies (synchronously, one after another) which I think can be a part of Flow CLI tool.

Flow CLI already comes with a Doctrine DBAL dependency and Flow Dbal Adapter.

Ideally we should point the CLI to the php file that would return a list of SynchronizationStrategies to execute. Or maybe even to a Synchronization object that would return a DataFrame for each strategy that would be executed in the command.

norberttech avatar Mar 09 '25 21:03 norberttech