Dotmim.Sync icon indicating copy to clipboard operation
Dotmim.Sync copied to clipboard

Managing Identity Columns as Primary Keys in DMS. Feasible ?

Open Mimetis opened this issue 2 years ago • 12 comments

Introduction

This issue is an explanation about "why" auto increment primary keys is not handled by DMS, yet. Some solutions exist to ensure a correct sync between several systems, using an identity primary keys for some tables. Theoretically, solutions exist, but in practice (you know.. real life) some really complex challenges can emerge from these solutions.

This post is NOT a solution, but more an open discussion on "How he could be done, if possible".

The challenge of auto increment keys is basically how can we afford any conflict when 2 records from 2 systems are using the same primary key (auto incremented) for a new record. Basically, any delete or update does not have any problem here. The main source of concerns comes from the insertion.

  • Client inserts a new record : Primary Key generated : 42
  • Server inserts a new record : Primary Key generated : 42

Is it an update conflict ? Is it two insertion ? If so, should we change the Client primary key to 43 ? (and what will happens on next insertion ?)

Solutions

Basically we have two solutions, to avoid insertions conflicts with auto increment primary keys:

1) Assigning ranges

The idea is to work with Identity columns as Primary keys and range of ids. For instance, you can define something like that:

  • Server IDS range: 1 - 99 999 999
  • Client 1 IDS range: 100 000 000 - 100 999 999
  • Client 2 IDS range: 101 000 000 - 101 999 999
  • Client 3 IDS range: 102 000 000 - 102 999 999

You can store these values in any configuration table.
When you reach the limit for one client (or server), just change the range for a new one.

  • PRO : It's working with SQL Server / MySQL and SQLite
  • CONS :
    • We are not thread safe (or more accurately, we are not SQL Sessions Safe)
    • We need to manipulate the seeding of identity columns on each sync

2) Assigning Seed / Step

In the same concept, another technic is to use the IDENTITY (m, n) formula as follows:

  • m = Serial number of a node (Server = 1, First Client = 2, Second Client = 3, etc.)
  • n = Number of replication nodes (Server + All Clients)

For example, we have 4 nodes (n = 4) to sync (1 Server + 3 Clients): The Serial number m will be 1 for Server and 2, 3, 4 for Client – A, Client – B and Client – C respectively.

According to the formula, the data will be populated as follows:

  • Server : IDENTITY (1, 4) - the column data sequence will be 1, 5, 9, 13
  • Client – A: IDENTITY (2, 4) - the column data sequence will be 2, 6, 10, 14
  • Client – B: IDENTITY (3, 4) - the column data sequence will be 3, 7, 11, 15
  • Client – C: IDENTITY (4, 4) - the column data sequence will be 4, 8, 12, 16

You can store the formulas values / increments in any configuration table.

  • PROS : Easy to implement if you already know how much clients you will have
  • CONS :
    • The downside of this method is that you need to now the MAX number of clients you will have in your organization.
    • It does not work with SQLite

Solution we will discuss here

Since the major clients databases used with DMS is SQLite, and since the merge replication from SQL Server uses also this technic, we will discuss the implementation of Solution 1: Assigning Ranges.

DMS possible Implementation

I've already worked on the primary keys auto increment implementation in DMS, even if it wasn't publish so far. I'm going to explain what I've done and what are the problems I did not resolve yet

Playing with Identity Columns On SQL Server

Firstly, let's see what we have, in terms of queries / language / hints, on SQL Server, when we are working with Identity columns: (Obviously, we have many others SQL statements for IDENTITY columns, that we don't need here)

  • SET IDENTITY INSERT : Allowing the insertion of a specific value in an identity column
  • DBCC CHECKIDENT : Managing Seed and Step for an identity column for a specific table.
  • IDENT_CURRENT : Getting the current identity increment values for a specific table.

Macro implementation in DMS

For the implementation, we will stick to SQL Server only, for now.

scopes_identity table

Firstly, we need to save and handle the ranges for each client and the server. Obviously, we can use a simple table (like scope_info) to stores these values. Nothing really complicated here. Here is a draft of the scope_info_identity table, stored and managed from the server side:

sycn_scope_id scope_table_name scope_identity_min scope_identity_ax scope_threshold
Client_ID_GUID_1 Customer 1 000 000 1 999 999 0.7
Client_ID_GUID_2 Customer 2 000 000 2 999 999 0.7
Client_ID_GUID_1 Product 1 000 000 1 999 999 0.7

Using this kind of tables, we can imagine a sync system where a Client will:

  • Send its last identity value for each table
  • Server will see if the value is above the threshold or not
  • If threshold is overpassed, Server will send back a new range of id for this specific table for the specific Client

Merging rows

Here are coming the problems. Here is a macro version of the actual Insert / Update (well on SQL Server, we are using a MERGE statement) stored procedure used when inserting new rows in any client or server (when dealing with identity primary keys):


SET IDENTITY_INSERT [Product] ON;

MERGE [Product] AS [base]
USING [changes] on [Product].[ProductID] = [base].[ProductID]
WHEN MATCHED  THEN UPDATE  ...
WHEN NOT MATCHED THEN INSERT ...

SET IDENTITY_INSERT [Product] OFF;

Something important to notice about SET IDENTITY_INSERT: It allows us to specify a value for our identity column, and if this value is higher than any stored value, the new internal identity counter is incremented as well.

A small sample:

Select IDENT_CURRENT('Product')
-- Result : 1

Insert into Product (Name) Values ('Bikes')
Select IDENT_CURRENT('Product')
-- Result : 2

SET IDENTITY_INSERT [Product] ON;
Insert into Product (ProductID, Name) Values (1000000, 'Bikes')
SET IDENTITY_INSERT [Product] OFF;
Select IDENT_CURRENT('Product')
-- Result : 1 000 000

Since we need to be sure each clients will stay in its range, we need to handle this behavior.

Client 1: Range 1 000 000 to 1 999 999

Let's see what happens for Client 1 with range 1 000 000 to 1 999 999 ? We need to ensure that all values inserted in from Client 1 should always be in that range.

But what happens if we have a sync on Client 1 with records coming from Client 3 where range is 3 000 000 to 3 999 999 ? Well, for sure, the internal identity value for that column will be reseed to something between 3 000 000 to 3 999 999. And we need to avoid that.

Here is a macro version to be sure we can avoid this situation:

SET IDENTITY_INSERT [Product] ON;
DECLARE @identity_current numeric(18, 0);
SELECT @identity_current = IDENT_CURRENT('Product');

MERGE [Product] AS [base]
USING [changes] on [Product].[ProductID] = [base].[ProductID]
WHEN MATCHED  THEN UPDATE  ...
WHEN NOT MATCHED THEN INSERT ...

SET IDENTITY_INSERT [Product] OFF;
DBCC CHECKIDENT ('Product', RESEED, @identity_current)

Using this technic, we will RESEED the internal increment value to the previous value, before any merge (with potential higher values for the identity column)

This solution leads to several problems ....

Multi SQL Sessions insertions

What happens if we have another session between the identity increment saving (SELECT @identity_current = IDENT_CURRENT('Product');) and the RESEED ?

SET IDENTITY_INSERT [Product] ON;
DECLARE @identity_current numeric(18, 0);
SELECT @identity_current = IDENT_CURRENT('Product');

-- ANOTHER SESSION TRY TO INSERTS A RECORD HERE IN THE CLIENT 1 DATABASE

MERGE ...

SET IDENTITY_INSERT [Product] OFF;
DBCC CHECKIDENT ('Product', RESEED, @identity_current)

Another session will insert a new record and our temp saved value will not be corect. The RESEED will work, but will be wrong as well. Since the reseeding will not be correct, we will have an error raised on next INSERT in the Client 1 database.

On the other part, what will happens if we try to insert something, in another session, just after the MERGE ?

SET IDENTITY_INSERT [Product] ON;
DECLARE @identity_current numeric(18, 0);
SELECT @identity_current = IDENT_CURRENT('Product');

MERGE ...

-- ANOTHER SESSION TRY TO INSERTS A RECORD HERE IN THE CLIENT 1 DATABASE

SET IDENTITY_INSERT [Product] OFF;
DBCC CHECKIDENT ('Product', RESEED, @identity_current)

Here, we may have a new record on Client 1 database, out of the authorized ranges of values....

Using a Serializable transaction.

Serializable transaction is the most protective level of transaction. Can we use it here ?

SET TRANSACTION ISOLATION LEVEL SERIALIZEABLE
SET IDENTITY_INSERT [Product] ON;
DECLARE @identity_current numeric(18, 0);
SELECT @identity_current = IDENT_CURRENT('Product');

MERGE ...

SET IDENTITY_INSERT [Product] OFF;
DBCC CHECKIDENT ('Product', RESEED, @identity_current)

The serializable transaction won't save us here, since it will not lock the entire table, and will allow any other session to get a new identity value and insert a new record in the table, potentially as we saw in our last 2 samples...

TABLOCK hint

The TABLOCK hint is more restrictive and will lock the entire table during our MERGE statement

SET IDENTITY_INSERT [Product] ON;
DECLARE @identity_current numeric(18, 0);
SELECT @identity_current = IDENT_CURRENT('Product');

MERGE Product WITH (TABLOCK)

SET IDENTITY_INSERT [Product] OFF;
DBCC CHECKIDENT ('Product', RESEED, @identity_current)

It will not save us at all since we still can insert before or after the MERGE statement...

Solutions of both side.

So far, we saw that the problem is coming from multi SQL sessions on the table that is updating during a sync MERGE statement....

FREEZE UI and prevents multi SQL Sessions

Can we say something like: "Hey, when you are doing a sync on this table, DO NOT TRY TO MAKE ANY INSERTS IN IT"

That could potentially works for client databases. You can easily freeze the UI and prevents this situation to happens.

But what will happens on the Server side ? The Server side needs to handle multiple sync at the same time, and we can't prevent two sync to happens at the same time ...

Disabling the internal counter increment

An ideal solution would be to disable the internal increment of the identity current value, during the MERGE statement:

SET IDENTITY_INSERT [Product] ON;
DECLARE @identity_current numeric(18, 0);

-- DISABLE INTERNAL IDENTITY CURRENT INCREMENT HERE, BUT ONLY FOR THIS SESSION
SELECT @identity_current = IDENT_CURRENT('Product');

MERGE Product WITH (TABLOCK)

SET IDENTITY_INSERT [Product] OFF;
DBCC CHECKIDENT ('Product', RESEED, @identity_current)

I guess this is what happens when we are using the NOT FOR REPLICATION option with the SQL MERGE replication.

The "Not For Replication" setting for identity columns allows replication to pass the ID value from the publisher to the subscriber without the subscriber creating a new ID. Since identity columns will always generate a new number when records are inserted, this setting allows you to ensure the values on both your publisher and subscribers stay in synch.

Source : https://www.mssqltips.com/sqlservertip/1274/change-not-for-replication-value-for-sql-server-identity-columns/

Conclusion

I'm not sure I've put here all the researchs and tests I've made so far on the Sync Identity columns problems, but at least this thread could be a good starting point to discuss about it.

If you have any idea, do not hesitate to share here your thoughts.

Mimetis avatar Aug 17 '21 14:08 Mimetis

To be honest - I do consider using sequential numeric values as primary keys in sync scenarios a fundamental design flaw. No one should ever do this. Rather, use sequential guids or the like to improve index clustering. So I am going to stay out of this discussion ;-)

gentledepp avatar Aug 22 '21 14:08 gentledepp

Speaking as one who is staying out of this discussion to another @gentledepp, how are sequential guids generated on the client?

I'm not sure sequential numeric values are necessarily a design flaw in a distributed system. Like most things, it depends. Most systems, including most systems discussed here do not generate more than 10 million rows per second. For those, DateTimeOffset.UtcTicks may be a viable primary key.

Or that's what I would have said if I weren't staying out of it.

VagueGit avatar Aug 23 '21 00:08 VagueGit

May I ask what the current status is of this? I am using a different sync framework at this moment, but I am thinking about switching.

In our database we use integer based keys and foreign keys. And before everybody mentions we should use guid’s, that is not possible for us. We need to communicate with other platforms/applications that don’t allow guid’s as id/foreign keys. I can add guid’s as primary key, but I would still need an extra unique integer based id field.

The current framework also works with id ranges per client. And those are updated on each sync. This setup was also mentioned above, but i was wondering if that is enabled in the current release?

percramer avatar Aug 27 '22 22:08 percramer

Hello @percramer

Today, using identity columns as primary keys works fine in DMS for Update and Delete statements, but you will have conflicts for Insert, if server and client are inserting a new record on the same range

Since we do not have yet a proper solution for handling identity columns on the server side, it will not be implemented. You can obviously implement the logic on your own, of course, but you will have to make some trade off, as explained in this post.

Just out of curiosity, which framework are you using today, and does it manage correctly the identity columns or do you have to make trade off ?

Mimetis avatar Aug 28 '22 08:08 Mimetis

Hi @Mimetis,

I am using Sqlite-sync . They made the express version open source , that only works with sqlite on the server and client side. I am using sql server on the server side, but it all works in the same way. I need to be able to do inserts from the client side since the application is also used in locations where there is no network connection.

What they do is also reserve id ranges per client (and per table of course) in a table, but those are synced to the client also. So each client will have a unique integer range which is updated (it needed) during the sync. And triggers on the client sqlite table ensure that the used values are from that range. So the server doesn’t have to deal with that anymore because the data from the clients already has got unique id’s.

so that is partially what you proposed also, but you wanted to keep that logic server side? Wouldn’t the setup of my current framework but a workable solutions for this also? It works without any problems at this moment.

sqlite-sync has got some other design flaws sadly though that prevent usage with larger amounts of data / changes

percramer avatar Aug 28 '22 09:08 percramer

In our database we use integer based keys and foreign keys. And before everybody mentions we should use guid’s, that is not possible for us. We need to communicate with other platforms/applications that don’t allow guid’s as id/foreign keys. I can add guid’s as primary key, but I would still need an extra unique integer based id field.

A possible solution could be to add a GUID column while keeping the Identity columns as a non-primary key indexed column. You can still index the identity column for performance, but the GUID column would be used to confirm that rows actually are referencing to the same data row. This could be in conjunction with cascading updates in SQL server*** to change the identity of a potential collision before syncing. I see this as "doable" but certainly not elegant or easy to implement. It would be very heavy on the user to make sure all the cascading updates and new GUID columns are set up properly, I would have doubts that it could be made into a general solution usable by average users.

In summary: Feasibility to provide a possible solution pathways to people interested and experienced enough to implement for their own organization: Quite feasible Feasibility to create an elegant, easy to implement general solution that could handle all of the possible individual requirements of all systems? Not feasible.

*** I'm expert-level in MS SqlServer, but I won't begin to claim I know what fancy features that the other supported DBs do or don't support.

SvdSinner avatar Feb 28 '23 16:02 SvdSinner

I've been using Microsoft Sync Framework since year 2011 and serving more than 6 billion transaction records on one Azure Hyperscale database sync with 6000+ clients.

Sadly when the DB designed 12 years ago we used Integer (even not bigint) for the identity column (auto increment), the way we handle conflict is following Microsoft's best practice, to set the sync group when conflict happen, use the client data or server data as the final copy.

We have set when conflict happened, the sync agent will use the client side's data to override the server side's data.

The key solution is to set composite columns, e.g. A retail group have 200+ shops, they have an (int) column for OrderId, so Shop A - Order Id 1001 Shop B - Order Id 1001 Shop C - Order Id 1001 Shop D - Order Id 1001

in order to prevent conflict, we used composite keys as primary key with auto increment. Shop A - Shop Id 123, Order Id 1001 (the transaction ref number will be 123-1001) Shop B - Shop Id 124, Order Id 1001 Shop C - Shop Id 125, Order Id 1001 Shop D - Shop Id 126, Order Id 1001

And we even introduced the terminal ID (as one shop may have more than one terminal) in the composite keys. Shop A - Shop Id 123, Till Id: 1, Order Id 1001 (the transaction ref number will be 123-1-1001) Shop A - Shop Id 123, Till Id: 2, Order Id 1001 Shop B - Shop Id 124, Till Id: 1, Order Id 1001 Shop C - Shop Id 125, Till Id: 1, Order Id 1001 Shop D - Shop Id 126, Till Id: 1, Order Id 1001

In summary: I think DMS do not need to handle the "creation" when there's conflicts of Identity Columns as Primary Keys, it is a design flaw of the SQL DB schema. The current conflict resolution (either set client override server, or server override client) already sufficient to handle this situation.

bearyung avatar Apr 19 '23 02:04 bearyung

Although that is a nice solution for the issue (composite keys) it can’t be always applied. We do have situation like yours but with a few systems we don’t have control over the db schema (since they are from external software applications which have been bought). And this applications only work with an singe column integer key sadly

percramer avatar Apr 19 '23 08:04 percramer

Although that is a nice solution for the issue (composite keys) it can’t be always applied. We do have situation like yours but with a few systems we don’t have control over the db schema (since they are from external software applications which have been bought). And this applications only work with an singe column integer key sadly

Do you have control over the local database? If yes, i am thinking you can create a new column (e.g. ShopId) for those local tables, and setup a stored procedure or create row trigger to automatically fill in the ShopId to 1 for shop A, shopId to 2 for shop B.

And then in DMS, set the filter to only bi-directional sync rows with shopID column equals to 1 to shop A, 2 to shop B, etc.

bearyung avatar Apr 19 '23 09:04 bearyung

No that is the main issue, we can't change the schema at all. Our SQL user doesn't have those rights. We where using Sqlite-sync in the past but for different reasons I am looking for something different. SQLite-sync uses a identity range per client which is actively updated when the user syncs. And then local triggers to ensure the ID is in the range of the user. I have been thinking about implementing this here also, but time...

percramer avatar Apr 19 '23 09:04 percramer

No that is the main issue, we can't change the schema at all. Our SQL user doesn't have those rights. We where using Sqlite-sync in the past but for different reasons I am looking for something different. SQLite-sync uses a identity range per client which is actively updated when the user syncs. And then local triggers to ensure the ID is in the range of the user. I have been thinking about implementing this here also, but time...

For allocating identity range per client approach, the sync server need to estimate the growth rate of the client, how frequent the client will connect to server to update the "used range" and to take the next batch of the available range, and in case the identity range used up all ids, can it overflow? (potentially the overflow part will conflict with other client's assigned range)

Another approach is to build a middle DB (either local or on the cloud), use DMS perform a 1:1 to sync the sqlite data to the middle DB, then fill back the shop ID in the middle DB, and perform another data sync from the middle DB to the centralized DB (merged with other shops data).

bearyung avatar Apr 19 '23 10:04 bearyung

i have been doing some testing with negative primary keys on my client app. So the auto increment step is -1 instead of 1.

And then during the sync i can easily check if they are negative and then replace them with a newly generated value. Just need to sync that back to the client then.

percramer avatar Feb 21 '24 14:02 percramer