hapi-fhir Possible data consistency bug performing quick updates to same resource via dao

Describe the bug Quick updates to the same resource via direct dao calls on the jpa server can lead to inconsistent database state.

We have some business logic performing resource transformations directly via the dao objects. In one case we update the same resource very quickly via the same thread and jpa transaction This may result in the second update being only partially applied to the resource database table (see example tables below). In this example the server will then answer GET calls with version 24 and will reject PUT with a conflict error, because it tries to insert version 25 again, causing ERROR: duplicate key value violates unique constraint "idx_resver_id_ver". Note the update timestamp is correct.

In this particular case, we were able to get rid of one of the update calls, so it is not a problem for us anymore. Nevertheless, this seems to be an unexpected behavior or even a bug.

HFJ_RES_VER

 res_id |       res_updated       | res_ver
--------+-------------------------+---------
   1675 | 2023-05-04 19:21:47.047 |      23
   1675 | 2023-05-04 19:21:51.783 |      24
   1675 | 2023-05-04 19:21:51.795 |      25

HFJ_RESOURCE

 res_id |       res_updated       | res_ver
--------+-------------------------+---------
   1675 | 2023-05-04 19:21:51.795 |      24

Expected behavior HFJ_RESOURCE should have correct res_ver.

Environment (please complete the following information):

HAPI FHIR Version: 6.4.4
OS: Debian
DB: PostgreSQL

Additional context The server was set up using a previous version of HAPI FHIR. The official schema migration tool was used during upgrading.

May 09 '23 16:05 G-2-Z

I've committed a test which attempts but does not replicate this issue, as in the branch ja_20230908_4843_resource_update_contention_issue in test testHighConcurrencyUpdatesToSameResource().

If anyone can provide reliable replication steps against 6.8.x or master, or even better can implement them in this test that would be great.

I vaguely remember seeing this same behaviour once a long time ago but haven't seen it recently.

Sep 08 '23 11:09 jamesagnew

One other thought is that it's possible that this is a Postgres-specific thing (I'm assuming everyone here is using PG?) I only tried to replicate in H2.

Sep 08 '23 11:09 jamesagnew

With (likely) related issue: https://github.com/hapifhir/hapi-fhir/issues/4690 we saw the issue in an Azure SQL instance (MS Sql Server). That's not to say there's not something different in H2, but it's not just a Postgres behavior. I have not yet attempted to reproduce it in a unit test though.

Sep 08 '23 13:09 XcrigX

We also see this as a production issue while doing sequential (not parallel) HTTP PATCH operations on the same resource. We're running HAPI 6.10.1 on a Postgres migrated to 7.0.0.20231213.1

org.hibernate.engine.jdbc.batch.internal.BatchingBatch
HHH000315: Exception executing batch [java.sql.BatchUpdateException: Batch entry 0 insert into hfj_res_ver (partition_date, partition_id, res_deleted_at, res_version, has_tags, res_published, res_updated, res_encoding, request_id, res_text, res_id, res_text_vc, res_type, res_ver, source_uri, pid) values (XXX) was aborted: ERROR: duplicate key value violates unique constraint "idx_resver_id_ver

Mar 22 '24 09:03 ohetrifork

@jamesagnew I extracted your test testHighConcurrencyUpdatesToSameResource() and applied it to my local (7.1.6-SNAPSHOT) master branch. It does hit the "Got conflict" case due to: "org.h2.jdbc.JdbcBatchUpdateException: Unique index or primary key violation: "PUBLIC.IDX_RESVER_ID_VER_INDEX_8 ON PUBLIC.HFJ_RES_VER..."

Is this the desired behavior when parallel updates to a resource occur, or should the server detect/serialize update operations to a single resource?

Here's the test for quick reference:

	@Test
	public void testHighConcurrencyUpdatesToSameResource() throws InterruptedException {
		createPatient(withId("A"), withActiveFalse());

		ThreadPoolTaskExecutor threadPool = ThreadPoolUtil.newThreadPool(10, 10, "hcu-", 0);
		List<Future<DaoMethodOutcome>> futures = new ArrayList<>(1000);
		for (int i = 0; i < 1000; i++) {
			Patient p = new Patient();
			p.setId("Patient/A");
			p.addIdentifier().setValue(Integer.toString(i));
			futures.add(threadPool.submit(() -> myPatientDao.update(p, new SystemRequestDetails())));
		}

		for (var next : futures) {
			try {
				DaoMethodOutcome daoMethodOutcome = next.get();
				ourLog.info("Got ID: {}", daoMethodOutcome.getId());
			} catch (ExecutionException e) {
				assertEquals(ResourceVersionConflictException.class, e.getCause().getClass());
				ourLog.info("Got conflict: {}", e.getCause().getMessage());
			}
		}

		Patient p = new Patient();
		p.setId("Patient/A");
		p.addIdentifier().setValue("0");
		myPatientDao.update(p, new SystemRequestDetails());
	}

Mar 27 '24 19:03 XcrigX

@jamesagnew another way to reproduce it is to as described here: https://chat.fhir.org/#narrow/stream/179167-hapi/topic/Trouble.20with.20placeholder.20IDs.20from.20Synthea/near/455033402 even with small datasets its evident that something is going on.

Jul 31 '24 07:07 jkiddo

I've added an additional test which runs the stress test on Postgres (and other DBs) and I'm still completely not able to reproduce this.

Aug 27 '24 13:08 jamesagnew

If anyone is still experiencing this issue, a unit test that demonstrates it would be great (using the branch https://github.com/hapifhir/hapi-fhir/compare/ja_20230908_4843_resource_update_contention_issue as a starting point could be helpful)

I'm also curious about the database version at play - The DatabaseVerificationWithPostgresIT test is verifying on Postgres 16 - Possibly this is a bug in older PG?

Aug 27 '24 13:08 jamesagnew

@jamesagnew - I don't have the full HAPI project set up locally at the moment to check your branch, but here is a test against the latest JPA master which for me throws a unique index violation against H2: "Unique index or primary key violation: "PUBLIC.IDX_RESVER_ID_VER_INDEX_8 ON PUBLIC.HFJ_RES_VER..."

https://github.com/XcrigX/hapi-fhir-jpaserver-starter/blob/parallel-updates-test/src/test/java/ca/uhn/fhir/jpa/starter/ParellelUpdateTest.java

edit to add: the test just launches 2 threads (configurable, but 2 is enough on my machine) that both update the same patient in parallel.

Aug 27 '24 18:08 XcrigX

@XcrigX that doesn't look like the same issue to me - if you submit 2 updates in parallel, having one of them fail with a constraint error is expected. So unless I'm missing something, that test looks to me like it's failing on correct behaviour.

The issue above seems to be that this expected failure (or some race condition anyhow) leaves the DB in an inconsistent state.

Aug 27 '24 18:08 jamesagnew

@jamesagnew Perhaps I'm misunderstanding the problem then. The OP was getting the same (I think) conflict on the HFJ_RES_VER table - though they said it was running on a single thread, so not completely clear if it's the same or not.

I was able to reproduce the same error using your testHighConcurrencyUpdatesToSameResource test back in March - which also appears to be attempting to update the same resource in parallel.

Anyway, the real-world case for me where I hit this actual problem was in an ingestion pipeline that is creating/sending bundles to the server in parallel. In my case, the bundles would frequently have common Organizations and Practitioners in them, and those were the resources that were failing when I tried to blindly UPSERT/PUT them. see #4690

Aug 27 '24 18:08 XcrigX

@XcrigX for that issue, you can use UserRequestRetryVersionConflictsInterceptor along with a request header to specify that the server should automatically retry if it sees a constraint error caused by parallel updates.

Aug 27 '24 18:08 jamesagnew

@XcrigX If you do not have the described table state where the resource is bricked and cannot be updated at all anymore, it is not the same issue. Regarding reproducibility, as I said in the issue, we got rid of the code causing it and never looked back since, so I currently don't know if it can still happen with current HAPI and Postgres. If I find the time I can look up the old code in the Repo and try to create a test case.

Aug 30 '24 08:08 G-2-Z

hapi-fhir hapi-fhir copied to clipboard

Possible data consistency bug performing quick updates to same resource via dao

hapi-fhir
hapi-fhir copied to clipboard