FHIR Ensure lastUpdated increases with versionId in a distributed environment

Describe the bug In a highly concurrent distributed environment, resource updates may be processed on different nodes. Those nodes may suffer clock drift which could potentially cause version N+1 having an earlier lastUpdated timestamp than version N.

It is desirable for the lastUpdated time to follow the same natural ordering as versionId for a given logical resource.

Environment Which version of IBM FHIR Server? 4.11.0

To Reproduce Steps to reproduce the behavior:

Deploy multiple-node where instances reside on different physical nodes (ideally in different data centers to increase the likelihood of clocks being slightly different)
Generate large number of parallel updates for one resource
Compare the lastUpdated timestamp of each version and see if it follows the same order as the versionId

Expected behavior Ideally, where N is the version number, we want the following to hold: lastUpdated(N) > lastUpdated(N-1).

Additional context This is important when using the whole-system history endpoint to ensure that resource version changes are returned in the expected order.

The logic has been updated to allow a drift up to 2 seconds which is very reasonable for a cluster with properly configured network time synchronization. If the drift is 2 or more seconds, the request is rejected with a 500 Server Error (because something is critically wrong with the server environment).

You can artificially trigger the issue if you are able to manually adjust the clock when running a local instance of the FHIR server:

Insert a patient p1
Set clock to manual and adjust it back by 1 hour
Update patient p1. But make sure that at least one field in the resource is different (otherwise the update will be skipped)
The update should be rejected because the current time comes before the current lastUpdated time of the resource
Reset clock to automatic time sync.
Update patient 1. The update should succeed.

Jun 06 '22 08:06 punktilious

QA tip: might be able to change the time on the system to force this one

Aug 11 '22 13:08 lmsurpre

QA quick test shows this isn't working as expected. By changing the clock on the dev machine manually, we see:

fhirdb=> select version_id, last_updated from fhirdata.patient_resources where logical_resource_id = 85766130;
version_id |        last_updated        
------------+----------------------------
          1 | 2022-08-12 12:33:03.456305
          2 | 2022-08-12 11:34:33.734935

Aug 12 '22 12:08 punktilious

After fixing the ResourceResult to capture the lastUpdated time when given a Resource, we can now see the expected failure when the system clock is manually set back one hour:

{"resourceType":"OperationOutcome","id":"7f-0-0-1-c5d31314-d547-4db7-b7a7-2703e5c52f61","issue":[{"severity":"fatal","code":"exception","details":{"text":"FHIRPersistenceException: Current time is before lastUpdated of current resource version."}}]}

Aug 12 '22 15:08 punktilious

But this doesn't work for undeleted resources, because we don't read an actual resource only the deletion marker resource, so we need to capture the current lastUpdated time from the DTO, not the resource.

Aug 12 '22 16:08 punktilious

In local, tried to insert a new resource(Patient), adjusted the clock back by 1 hour, tried to update the same resource and got the below error

{
    "resourceType": "OperationOutcome",
    "id": "a-e9-93-fe-12350f71-d0d8-40ef-863a-886bc50b8b3c",
    "issue": [
        {
            "severity": "fatal",
            "code": "exception",
            "details": {
                "text": "FHIRPersistenceException: Current time is before lastUpdated of current resource version."
            }
        }
    ]
}

After the clock was reset to automatic time sync the update was successful.

This is working as expected.

Aug 22 '22 08:08 PrasannaHegde1

FHIR FHIR copied to clipboard

Ensure lastUpdated increases with versionId in a distributed environment

FHIR
FHIR copied to clipboard