updated_at should be used in conjunction with syncTime
To avoid the race conditions, please try using updated_at in conjunction with currentSyncStartTime.
currentSyncStartTime here refers to the timestamp at which the current sync has started.
For example:
currentSyncStartTime= 12:04
select * from users where updated_at >= 12:02 && updated_at < currentSyncStartTime
round2:
lastSyncStartTime = currentSyncStartTime
currentSyncStartTime = 12:06
select * from users where updated_at >= lastSyncStartTime && updated_at < currentSyncStartTime
round3:
lastSyncStartTime = currentSyncStartTime
currentSyncStartTime = 12:15
select * from users where updated_at >= lastSyncStartTime && updated_at < currentSyncStartTime
Few points worth noting:
- the
synccan take as long as required (e.g. using paging for large batch of records) without any problem - while the
syncis happening, updates can happen in parallel and the outcome is not affected by race conditions. All updates will beeventuallysynced without any loss.
If enough care is taken on the destination side (to ensure old updates do not overwrite later updates), it is even possible to run multiple syncs in parallel safely (for example, start sync for every 10 minutes on a timer, irrespective of the earlier sync completed or not).
Thanks for contributing! This "windows" approach is good one. It'a approximately the strategy that Airflow sync uses, if I recall correctly. There are still a few issues to be aware of, though.
For example, if updatedAt can change, you still have the offset I described in my blog post. If you don't do offsets, then you might have memory issues, so it gets tricky.
There is also the now concept at play here. Grouparoo, for example, is often syncing with data warehouses that are "behind" the sync server's now. So you have to use the warehouses's concept of now, which is the highwater mark I describe in the code and blog post.
What would work well is to divide what needs to be synced into time windows that are of reasonable size. Then, as you note, there is the advantage of doing those in parallel. The tricky, of course, is in the dividing. I'd be curious to research how others do this, but I can't think of anything better than walking forward with a LIMIT query.
That would look very close to the steps or batch approaches, but instead of walking forward one "window" at a time, it would come up with the "windows" first and then do them in parallel. It would have some of the same duplication issues as they do with the >= situation and could also be made smarter by the secondary id knowledge. You also could get rid of most of the duplication issue by systematically doing the >= in smart ways when in the past.
I think some might do it a priori by always picking a N minute window. This seems to have a risk of have many records in any given window (and then maybe the offset issue), but it would work, too. It would, however be able to remove the duplication issue.
@KrishnaPG Are you interested to contributing a "windows" algorithm to this repo?
Thank you @bleonard
but I can't think of anything better than walking forward with a LIMIT query.
Is LIMIT not a good approach? Seems like paginating the sync-window with a decent LIMIT sounds fine to me, unless I am missing something.
LIMIT is fine, but you'd end up doing that N times. That is, N LIMIT 100 queries to get the batches. And then you'd do that again in each parallel bit.
For example, it might be better to do one query with COUNT(*) and GROUP BY DAY(updated_at) or something. But it's hard to know whether that should be HOUR, etc.
I wonder if there is some fancy GROUP concept that I don't know about that puts things in buckets. A little bit of googling suggests NTILE might be interesting: https://www.postgresql.org/docs/9.5/functions-window.html
You still might need to know the right interval.
Just brianstorming.
If no. of records to be synced is the problem, then there is no need for paging or doing N times at all. Just pick up a (configurable) number as your record-set size LIMIT and keep doing it every few seconds or minutes, just once. No need for paging, No need to run more than one sync in parallel either. It will eventually catchup. For example,
const rows = select * from users where updated_at >= last_synced_record_update_time LIMIT 1000 ORDER BY updated_at ASC
last_synced_record_update_time = rows[rows.length - 1].updated_at
This is especially helpful for fast changing records (e.g. when the number of updates happening are high, such as highly popular eCommerce / website), as it will only sync the latest values and will completely avoid syncing the in-between changes between the sync-runs.
One advantage with this method is: the sync timer can be tuned automatically.
If the returned rows is empty or less than the record-set size (e.g. 1000 in the above example), then one could use exponential decay to reduce the time frequency on which the sync is run. In other words, we adjust the sync timer gap in relation to the speed of the updates. A low popular website will run less number of syncs (subjected to a lower ceiling limit), while highly changing websites will run higher frequency syncs (subjected to higher ceiling time limit). All with
-
eventualconsistency (for large number of records) - duplicate elimination (for high-frequency updates) i.e. throttling
- no race conditions or data loss
In fact, the record-set size (e.g. the 1000 in the above example), can also be dynamically tuned if one is willing to add on some complexity to the implementation. For example, the record-set size is to be adjusted as a % of the total remaining records to be sync. The initial value can be some % of the total number of records in the DB. Something on the lines of:
const pendingRecordsCount = select count(*) from users where updated_at >= last_synced_record_update_time
const record_set_size = pendingRecordsCount < 1000 ? pendingRecordsCount : min(25 % of pendingRecordsCount, 2000);
const rows = select * from users where updated_at >= last_synced_record_update_time LIMIT record_set_size ORDER BY updated_at ASC
last_synced_record_update_time = rows[rows.length - 1].updated_at
The auto_tuning can be something on the lines of:
if (record_set_size < 1000) {
add_few_more_seconds_delay_to_the_next_sync(); // slow-down the sync, we have already cached-up with all records
}
else if(record_set_size == 2000) {
// we have more records to catchup. speed up little bit, But not too much, since on a high-update server,
// the more we wait, the less sync-work we have to do, as it avoids syncing the in-between duplicates.
// In other words, throttling
schedule_next_sync_few_seconds_earlier();
}
Updates can be "overwrites" of previous records (update, delete etc.), or addition of new records. The above comment works for both scenarios, but gives priority to "overwrite" style of updates.
Yes, I believe your approach is more or less in the steps or batch approaches.
The only addition to what you said was to handle the case where there are more than batchSize records with the same timestamp. Each of those handles that case slightly differently. Without that, there is an infinite loop possibility.
Deletes are notoriously trickier situations and none of these algorithms really handle rows being removed per se.