RANGER-5205: Liquibase based Zero-Downtime-Upgrade framework for Rang…
…er/KMS
What changes were proposed in this pull request?
A new framework using liquidate to support ZDU Design document and architecture: https://docs.google.com/document/d/1kDbrdN9n2yNzmkwwDd_Gx90jxN3VxTfTKvMKEJiplG0/edit?tab=t.0
How was this patch tested?
Unit tests have been added Tested in a private cluster WIP: Testing in docker setup
@fateh288 , thanks for the implementation.
Following is the review comments on design part, not on implementation.
As per one of the discussion with you and my understanding, consider following scenario:
There is a change in the format of the data in one particular column. Say Table1.Col1, as per current approach, during schema upgrade, a new column (say col2) of required type will be created and older data will be copied to new column after transformation. Now we may have two scenario:
- New binaries not yet applied, means, read/write will continue to happen with old column. How you are handling the case, where a column is being updated by the application after copying from col1-> col2 ? If we are using a trigger on each data modification that will copy to new column. Then, we have multiple triggers for the same column ( if multiple modification is happening), and these triggers may also overlap with the job/cursor that is copying old format data to new column.
This scenario is posing two risks,
- Triggers processing ( due to multiple updates on same column ) may be processed in different order and older data may be written to new column.
- what if some trigger processing/execution fails, We may log this, report this. But do we have any systematic way ( out of box from the framework itself) to detect this and retry ?
I can think of one solution that may work but please let me know if it fits in the current approach:
Step1: Apply the schema changes, means, create a new column. Step2: Using dynamic configuration (needs to be implemented), let running instances know that ZDU process has started and in such cases they will write to old column in older format and additionally they will also write an event in one table (say ZDU upgrade table, a new table). After writing at both places , then only, transaction will be successful.
Step3: As part of your framework, that contains logic to migrate data from col1 to col2, you code/cursor should read event in the insertion order to process this and once it is processed, then only it should be deleted from table. If any Runtime error occurs, since it has not been deleted, it will be retried.
This approach ensures, it would be processed in the order in which they occurred. And , out of bix from the framework, there would be way to know if all migration done or not.
We should also consider adding one step in "Finalisation Step", to check that this new event table should be empty.
This is just an idea, feel free to add your input if above scenario can be handled in different or better way.
@vikaskr22
New binaries not yet applied, means, read/write will continue to happen with old column. How you are handling the case, where a column is being updated by the application after copying from col1-> col2 ?
--- great point. No, this case is not being handled - any update to old column after it has been copied but before trigger is created won't be updated in the new column. For KMS, it was observed that copy one column to another for 1M rows took 35 seconds. Realistically we won't have 1M rows and would just be a few hundred max rows so time between copy of column and trigger creation would be a few seconds max.
If we are using a trigger on each data modification that will copy to new column. Then, we have multiple triggers for the same column ( if multiple modification is happening), and these triggers may also overlap with the job/cursor that is copying old format data to new column.
--- Trigger is being created after the copy job is completed, so I am not clear how data copy job and trigger will overlap
This scenario is posing two risks,
Triggers processing ( due to multiple updates on same column ) may be processed in different order and older data may be written to new column.
--- from my understanding, is there are multiple modifications to same column, each of the triggers are processed sequentially - this behavior is out of the box for databases -- we can confirm this incase there is any gap in my understanding
what if some trigger processing/execution fails, We may log this, report this. But do we have any systematic way ( out of box from the framework itself) to detect this and retry ?
--- My understanding is that if the trigger fails, then the original transaction also fails i.e. original update will be reverted if the trigger fails to copy the data to the new column basically trigger is now a part of original transaction
Step1: Apply the schema changes, means, create a new column.
Step2: Using dynamic configuration (needs to be implemented), let running instances know that ZDU process has started and in such cases they will write to old column in older format and additionally they will also write an event in one table (say ZDU upgrade table, a new table). After writing at both places , then only, transaction will be successful.
Step3: As part of your framework, that contains logic to migrate data from col1 to col2, you code/cursor should read event in the insertion order to process this and once it is processed, then only it should be deleted from table. If any Runtime error occurs, since it has not been deleted, it will be retried.
--- I am not clear on this. The old bits / old kms instance will not have logic to write to a new table. Old application is not aware of new schema changes, new table that needs to be written to. Writing to both new and old columns can be done by new KMS instance so that new data is available to the old instances too. But how old instances can write to new column I don't fully understand.
@fateh288 , I again went through the design and as explained by you as well, DB triggers processing happens as part of application transaction. Then in such cases, we don't need any separate mechanism to ensure data durability. So triggers getting executed out of order case is not valid. It should be taken care by the underlying DB. And hence the observation regrading triggers is INVALID, please ignore that.
Now let's see how we can address following scenario:
Any update to old column after it has been copied but before trigger is created won't be updated in the new column.