clinical-reasoning icon indicating copy to clipboard operation
clinical-reasoning copied to clipboard

FHIR Bundle Engine Performance improvements

Open csenn opened this issue 1 year ago • 5 comments

I've been wondering about the maximum performance possible with cql_engine for use over populations. The case I'm looking at is evaluating a single CQL library (along with it's included libraries and ValueSets) against a population of patient FHIR Bundles.

I was having issues with the evaluator Dagger api because I wanted to ensure compiling the CQL to engine.Library only once (instead of on every iteration of the loop of the patient which required a new DataProviderFactory). But I was able to use cql_engine more directly to get it working with the same underlying classes.

In the test I'm using 1,000 Synthea FHIR bundles. First, they are all loaded from disk into memory and parsed as HAPI FHIR bundles. Then all the CQL is translated into ELM and loaded into engine.Library classes.

The test uses a small to mid sized library with 4 total included libraries (including FHIRHelpers). It also includes two ValueSets with 116 codes and 15 codes respectively.

The current performance is quite good with 2,745 FHIR Bundles per second over 15 runs. But just to see if there was any room for improvement, I created a FlameGraph to check out the execution and some low hanging fruit immediately popped out:

FlameGraph Perf

81% of the execution is taking place in the anyCodeInValueSet function.

This algorithm is effectively an O(mn) calculation, where:

  • m is the number of resources of a certain type (such as Encounter resources)
  • n is the number of codes in the ValueSet

Instead, if the Coding was stored in a hashable lookup of some sort you should be able to use set operations and get this down to O(min(m,n)). See python set operations time complexity for reference.

Instead of:

for resource in resources:
     for valueSetCode in valueSet.codes:
          if (resource.code === valueSetCode)
               return true

You could do:

for resource in resources:
      if (resource.code in valueSet.hashedCodeLookup)
           return true

There are a few things to work out, such as there being multiple codes in a resource, or the case that a ValueSet has less codes than the number of resources where you'd want to flip the loop.

I'll spend a little time prototyping when I get a chance, but interested in your thoughts on this and if it makes sense.

csenn avatar Aug 25 '22 15:08 csenn