FHIR
FHIR copied to clipboard
Ensure lastUpdated increases with versionId in a distributed environment
Describe the bug In a highly concurrent distributed environment, resource updates may be processed on different nodes. Those nodes may suffer clock drift which could potentially cause version N+1 having an earlier lastUpdated timestamp than version N.
It is desirable for the lastUpdated time to follow the same natural ordering as versionId for a given logical resource.
Environment Which version of IBM FHIR Server? 4.11.0
To Reproduce Steps to reproduce the behavior:
- Deploy multiple-node where instances reside on different physical nodes (ideally in different data centers to increase the likelihood of clocks being slightly different)
- Generate large number of parallel updates for one resource
- Compare the lastUpdated timestamp of each version and see if it follows the same order as the versionId
Expected behavior
Ideally, where N is the version number, we want the following to hold: lastUpdated(N) > lastUpdated(N-1)
.
Additional context This is important when using the whole-system history endpoint to ensure that resource version changes are returned in the expected order.
The logic has been updated to allow a drift up to 2 seconds which is very reasonable for a cluster with properly configured network time synchronization. If the drift is 2 or more seconds, the request is rejected with a 500 Server Error (because something is critically wrong with the server environment).
You can artificially trigger the issue if you are able to manually adjust the clock when running a local instance of the FHIR server:
- Insert a patient p1
- Set clock to manual and adjust it back by 1 hour
- Update patient p1. But make sure that at least one field in the resource is different (otherwise the update will be skipped)
- The update should be rejected because the current time comes before the current lastUpdated time of the resource
- Reset clock to automatic time sync.
- Update patient 1. The update should succeed.
QA tip: might be able to change the time on the system to force this one
QA quick test shows this isn't working as expected. By changing the clock on the dev machine manually, we see:
fhirdb=> select version_id, last_updated from fhirdata.patient_resources where logical_resource_id = 85766130;
version_id | last_updated
------------+----------------------------
1 | 2022-08-12 12:33:03.456305
2 | 2022-08-12 11:34:33.734935
After fixing the ResourceResult to capture the lastUpdated time when given a Resource, we can now see the expected failure when the system clock is manually set back one hour:
{"resourceType":"OperationOutcome","id":"7f-0-0-1-c5d31314-d547-4db7-b7a7-2703e5c52f61","issue":[{"severity":"fatal","code":"exception","details":{"text":"FHIRPersistenceException: Current time is before lastUpdated of current resource version."}}]}
But this doesn't work for undeleted resources, because we don't read an actual resource only the deletion marker resource, so we need to capture the current lastUpdated time from the DTO, not the resource.
In local, tried to insert a new resource(Patient), adjusted the clock back by 1 hour, tried to update the same resource and got the below error
{
"resourceType": "OperationOutcome",
"id": "a-e9-93-fe-12350f71-d0d8-40ef-863a-886bc50b8b3c",
"issue": [
{
"severity": "fatal",
"code": "exception",
"details": {
"text": "FHIRPersistenceException: Current time is before lastUpdated of current resource version."
}
}
]
}
After the clock was reset to automatic time sync the update was successful.
This is working as expected.