pathling Array representations within FHIRPath queries

This issue represents the work to refactor the FHIRPath query engine to use arrays and array operations to represent FHIRPath expressions. This will have a number of benefits, including the elimination of joins and the simplification of query plans.

This work was initiated within #1438, but we would like to integrate this work back into main without the SQL on FHIR views implementation itself. This will allow us to evaluate this change separately, and control the magnitude of the change.

Outstanding tasks:

[x] Create a new branch (off fhir-views) and a PR for this change
[x] Do a partial implementation on aggregate to determine overall feasibility
[ ] Complete implementation of FHIRPath functions, operators and other syntax elements
[ ] Reactivate aggregate and extract executors and refactor to work with new implementation
[ ] Reactivate test suite and refactor as necessary
[ ] Update benchmarking and evaluate relative to main

Oct 03 '23 00:10 johngrimes

I have found an issue with the implementation on issue/1759, given the following query:

observations = data.view(
    "Observation",
    select=[
        {
            "column": [
                {
                    "description": "Observation ID",
                    "path": "getResourceKey()",
                    "name": "id",
                },
                {
                    "description": "Patient ID",
                    "path": "subject.getReferenceKey()",
                    "name": "patient_id",
                },
                {
                    "description": "Observation date",
                    "path": "effectiveDateTime",
                    "name": "date",
                },
            ],
            "select": [
                {
                    "forEach": "code.coding",
                    "column": [
                        {
                            "description": "Observation code",
                            "path": "code",
                            "name": "code",
                        },
                    ],
                },
                {
                    "forEach": "value.ofType(Quantity)",
                    "column": [
                        {
                            "description": "Total cholesterol unit",
                            "path": "unit",
                            "name": "unit",
                        },
                        {
                            "description": "Total cholesterol value",
                            "path": "value",
                            "name": "value",
                        },
                    ],
                },
            ],
        }
    ],
    where=[
        {
            "description": "Total cholesterol > 240 mg/dL",
            "path": "where(code.coding.where(system = 'http://loinc.org').exists(code = '2093-3')).value.ofType(Quantity) > 240 'mg/dL'",
        }
    ],
)

observations.explain()

I get the following plan:

{'resource': 'Observation', 'select': [{'column': [{'description': 'Observation ID', 'path': 'getResourceKey()', 'name': 'id'}, {'description': 'Patient ID', 'path': 'subject.getReferenceKey()', 'name': 'patient_id'}, {'description': 'Observation date', 'path': 'effectiveDateTime', 'name': 'date'}], 'select': [{'forEach': 'code.coding', 'column': [{'description': 'Observation code', 'path': 'code', 'name': 'code'}]}, {'forEach': 'value.ofType(Quantity)', 'column': [{'description': 'Total cholesterol unit', 'path': 'unit', 'name': 'unit'}, {'description': 'Total cholesterol value', 'path': 'value', 'name': 'value'}]}]}], 'where': [{'description': 'Total cholesterol > 240 mg/dL', 'path': "where(code.coding.where(system = 'http://loinc.org').exists(code = '2093-3')).value.ofType(Quantity) > 240 'mg/dL'"}]}
== Physical Plan ==
Project [Observation#974.id_versioned AS id#832, Observation#974.subject.reference AS patient_id#834, Observation#974.effectiveDateTime AS date#836, @8ehd09#982.code AS code#838, Observation#974.valueQuantity.unit AS unit#840, Observation#974.valueQuantity.value AS value#842]
+- Filter isnotnull(@8ehd09#982)
   +- Generate explode(Observation#974.code.coding), [Observation#974], false, [@8ehd09#982]
      +- Project [struct(id, id#883, id_versioned, id_versioned#884, meta, meta#885, implicitRules, implicitRules#886, language, language#887, text, text#888, identifier, identifier#889, basedOn, basedOn#890, partOf, partOf#891, status, status#892, category, category#893, code, code#894, ... 62 more fields) AS Observation#974]
         +- Filter (isnotnull(valueQuantity#909) AND CASE WHEN CASE WHEN NOT CASE WHEN isnotnull(filter(filter(code#894.coding, lambdafunction((lambda x_0#1015.system = http://loinc.org), lambda x_0#1015, false)), lambdafunction((lambda x_2#1023.code = 2093-3), lambda x_2#1023, false))) THEN (size(filter(filter(code#894.coding, lambdafunction((lambda x_0#1015.system = http://loinc.org), lambda x_0#1015, false)), lambdafunction((lambda x_2#1023.code = 2093-3), lambda x_2#1023, false)), true) = 0) ELSE true END THEN (valueQuantity#909._code_canonicalized = g.m-3) ELSE false END THEN UDF(CASE WHEN NOT CASE WHEN isnotnull(filter(filter(code#894.coding, lambdafunction((lambda x_1#1016.system = http://loinc.org), lambda x_1#1016, false)), lambdafunction((lambda x_3#1024.code = 2093-3), lambda x_3#1024, false))) THEN (size(filter(filter(code#894.coding, lambdafunction((lambda x_1#1016.system = http://loinc.org), lambda x_1#1016, false)), lambdafunction((lambda x_3#1024.code = 2093-3), lambda x_3#1024, false)), true) = 0) ELSE true END THEN valueQuantity#909._value_canonicalized END, [2400,0]) ELSE false END)
            +- FileScan parquet [id#883,id_versioned#884,meta#885,implicitRules#886,language#887,text#888,identifier#889,basedOn#890,partOf#891,status#892,category#893,code#894,subject#895,focus#896,encounter#897,effectiveDateTime#898,effectiveInstant#899,effectivePeriod#900,effectiveTiming#901,issued#902,performer#903,valueBoolean#904,valueCodeableConcept#905,valueDateTime#906,... 19 more fields] Batched: false, DataFilters: [isnotnull(valueQuantity#909), CASE WHEN CASE WHEN NOT CASE WHEN isnotnull(filter(filter(code#894..., Format: Parquet, Location: PreparedDeltaFileIndex(1 paths)[file:/Users/gri306/Library/CloudStorage/OneDrive-CSIRO/Data/synth..., PartitionFilters: [], PushedFilters: [IsNotNull(valueQuantity)], ReadSchema: struct<id:string,id_versioned:string,meta:struct<id:string,versionId:string,versionId_versioned:s...

It would be better if we did not have to retrieve all of the columns for this query, only the ones that are required.

Jan 11 '24 01:01 johngrimes

I have merged main into issue/1759, and updated the version to 6.5.0-SNAPSHOT.

Jan 11 '24 01:01 johngrimes

@piotrszul Would you mind creating a pull request for this change, so that it is easier to follow activity on the branch?

Jan 11 '24 01:01 johngrimes

pathling pathling copied to clipboard

Array representations within FHIRPath queries

pathling
pathling copied to clipboard