org.hl7.fhir.core Validation of bundle entry references is O(n²) and it can make validation of a file take hours

Validation of references to bundle entries appears to be quadratic in the number of references/entries. Here are the timings for validating minimal collection bundles whose Composition entries have varying numbers of author references to Patient resources included in the bundle. The validator command line contained the series of test files twice and the timings shown are from the second series, in order to minimise the effect of JVM/startup vagaries.

FHIR Validation tool Version 5.6.46 (Git# eca2fa5a5ce6). Built 2022-05-12T10:35:44.416Z (6 days old)
  Java:   16.0.1 from C:\Program Files\Java\jdk-16.0.1 on amd64 (64bit). 8172MB available
...
  Validate Bundle_1_via_urn.xml   ..Detect format for Bundle_1_via_urn.xml
 00:00.0007
  Validate Bundle_10_via_urn.xml   ..Detect format for Bundle_10_via_urn.xml
 00:00.0008
  Validate Bundle_100_via_urn.xml   ..Detect format for Bundle_100_via_urn.xml
 00:00.0033
  Validate Bundle_1000_via_urn.xml   ..Detect format for Bundle_1000_via_urn.xml
 00:00.0842
  Validate Bundle_10000_via_urn.xml   ..Detect format for Bundle_10000_via_urn.xml
 01:17.0941
  Validate Bundle_100000_via_urn.xml   ..Detect format for Bundle_100000_via_urn.xml
 03:17:12.0441

Note the first jump from 100 to 1000, which seems to indicate that validation time starts to get dominated by the complexity of the reference checking logic around this point, and the huge jump from 1000 to 10000. The timing for 100000 is from a manual run.

Replacing the URN references with ones that use logical ids (e.g. {resource-type}/{id}) doubles the timings for bigger counts:

   Validate Bundle_10000_via_id.xml   ..Detect format for Bundle_10000_via_id.xml
 02:39.0600

The resources were modelled after the following template (without the comments and formatting):

<Bundle xmlns="http://hl7.org/fhir">
	<!-- all included fields are strictly mandatory -->
	<type value="collection"/>
	<entry>
		<fullUrl value="urn:uuid:016e8556-ce1c-40cb-aa3d-8b7a3e32e3b1"/>
		<resource>
			<Composition>
				<status value="final"/>
				<!-- using a regular document type code here in order to avoid an informational message -->
				<type>
					<coding>
						<system value="http://loinc.org"/>
						<!-- laboratory report -->
						<code value="11502-2"/>
					</coding>
				</type>
				<date value="2022-05-18"/>
				<author>
					<reference value="urn:uuid:13ed3afe-3d31-4148-8e95-dd62d140da57"/>
				</author>		
				<!-- additional author references go here -->
				<title value="(mandatory title)"/>
			</Composition>
		</resource>
	</entry>
	<entry>
		<fullUrl value="urn:uuid:13ed3afe-3d31-4148-8e95-dd62d140da57"/>
		<resource>
			<!-- Patient must have at least one child element because it cannot have a @value -->
			<Patient>
				<active value="true"/>
			</Patient>
		</resource>	
	</entry>
	<!-- additional Patient entries go here -->
</Bundle>

For the files that used logical ids for references, the active element was replaced with an appropriateid element, naturally.

This validator behaviour can cause problems in production. For example, the files involved in invoicing for electronic prescriptions in Germany can contain up to 25.000 prescription entries (and hence references to those entries), resulting in validation times of several hours per file. (see eRezeptAbrechnungdaten at Simplifer, and in particular GKVSV_PR_TA7_Rechnung_Bundle)

I wanted you to be aware of this practical aspect, so that you can keep it in mind when planning structural changes.

The optimal solution would probably be a self-activating strategy, whereby the validator simply counts the number of failed comparisons during a search and switches to a HashMap-based approach if that count exceeds a configured threshold like 42 (restarting the current search after the index is built). That way the cost of indexing would only be incurred when it matters, resulting in speedups of several orders of magnitude.

May 19 '22 11:05 DarthGizka

Release 5.6.47 improves the performance of reference resolution dramatically. Here are the results of re-running the earlier test with the new release:

FHIR Validation tool Version 5.6.47 (Git# 567a9b2ce7bd). Built 2022-05-27T17:24:34.812Z (14 hours old)
  Java:   11.0.15 from /usr/lib/jvm/java-11-openjdk-amd64 on amd64 (64bit). 16012MB available
...
  Validate Bundle_1_via_urn.xml
Validate Bundle against http://hl7.org/fhir/StructureDefinition/Bundle..........20..........40..........60..........80.........|
 00:00.003
  Validate Bundle_10_via_urn.xml
Validate Bundle against http://hl7.org/fhir/StructureDefinition/Bundle..........20..........40..........60..........80.........|
 00:00.003
  Validate Bundle_100_via_urn.xml
Validate Bundle against http://hl7.org/fhir/StructureDefinition/Bundle..........20..........40..........60..........80.........|
 00:00.010
  Validate Bundle_1000_via_urn.xml
Validate Bundle against http://hl7.org/fhir/StructureDefinition/Bundle..........20..........40..........60..........80.........|
 00:00.115
  Validate Bundle_10000_via_urn.xml
Validate Bundle against http://hl7.org/fhir/StructureDefinition/Bundle..........20..........40..........60..........80.........|
 00:04.257
  Validate Bundle_100000_via_urn.xml
Validate Bundle against http://hl7.org/fhir/StructureDefinition/Bundle..........20..........40..........60..........80.........|
 23:11.267

(erroneous zeroes between decimal point and second fraction removed)

The test was run on a system that is roughly comparable to the one from the original test. The bundle with 10000 references was validated almost twenty times as fast as with 8.6.46; the speedup for 100000 is not as big but still almost an order of magnitude. The growth is now clearly below quadratic, which is an important achievement.

In order to get a more precise picture of the progression I ran another test with reference counts up to 25000 in steps of 1000 and graphed it:

815_validation_time

The anomaly near 17000 is most probably due to a garbage collection. Timings went up about 40% during subsequent runs, most likely due to thermal throttling of the CPU. Comparing the values showed that the CPU must already have been throttled for the last handful of steps in the graphed first run.

815_references_per_second

@grahamegrieve The full-size test file with 24984 electronic prescriptions (a few KiB shy of 1 GiB) that originally took several hours to validate was done in 28:51 min. Output excerpt and the command line for invoking the validator are reported in issue #35 at the ABDA reference validator repo. The code system fixup package that needs to be installed manually into the local FHIR package cache is attached to this post; all other packages get downloaded automatically as dependencies of the main profile package given on the command line (de.gkvsv.erezeptabrechnungsdaten#1.1.0).

dav.kbv.sfhir.cs.vs-1.0.3-json.tgz.zip . P.S.: had to wrap the .tgz in a .zip because GitHub doesn't allow attaching .tgz

May 28 '22 15:05 DarthGizka

So how do things stand now?

Jul 19 '22 20:07 grahamegrieve

org.hl7.fhir.core org.hl7.fhir.core copied to clipboard

Validation of bundle entry references is O(n²) and it can make validation of a file take hours

org.hl7.fhir.core
org.hl7.fhir.core copied to clipboard