pathling icon indicating copy to clipboard operation
pathling copied to clipboard

Array representations within FHIRPath queries

Open johngrimes opened this issue 2 years ago • 3 comments

This issue represents the work to refactor the FHIRPath query engine to use arrays and array operations to represent FHIRPath expressions. This will have a number of benefits, including the elimination of joins and the simplification of query plans.

This work was initiated within #1438, but we would like to integrate this work back into main without the SQL on FHIR views implementation itself. This will allow us to evaluate this change separately, and control the magnitude of the change.

Outstanding tasks:

  • [x] Create a new branch (off fhir-views) and a PR for this change
  • [x] Do a partial implementation on aggregate to determine overall feasibility
  • [ ] Complete implementation of FHIRPath functions, operators and other syntax elements
  • [ ] Reactivate aggregate and extract executors and refactor to work with new implementation
  • [ ] Reactivate test suite and refactor as necessary
  • [ ] Update benchmarking and evaluate relative to main

johngrimes avatar Oct 03 '23 00:10 johngrimes

I have found an issue with the implementation on issue/1759, given the following query:

observations = data.view(
    "Observation",
    select=[
        {
            "column": [
                {
                    "description": "Observation ID",
                    "path": "getResourceKey()",
                    "name": "id",
                },
                {
                    "description": "Patient ID",
                    "path": "subject.getReferenceKey()",
                    "name": "patient_id",
                },
                {
                    "description": "Observation date",
                    "path": "effectiveDateTime",
                    "name": "date",
                },
            ],
            "select": [
                {
                    "forEach": "code.coding",
                    "column": [
                        {
                            "description": "Observation code",
                            "path": "code",
                            "name": "code",
                        },
                    ],
                },
                {
                    "forEach": "value.ofType(Quantity)",
                    "column": [
                        {
                            "description": "Total cholesterol unit",
                            "path": "unit",
                            "name": "unit",
                        },
                        {
                            "description": "Total cholesterol value",
                            "path": "value",
                            "name": "value",
                        },
                    ],
                },
            ],
        }
    ],
    where=[
        {
            "description": "Total cholesterol > 240 mg/dL",
            "path": "where(code.coding.where(system = 'http://loinc.org').exists(code = '2093-3')).value.ofType(Quantity) > 240 'mg/dL'",
        }
    ],
)

observations.explain()

I get the following plan:

{'resource': 'Observation', 'select': [{'column': [{'description': 'Observation ID', 'path': 'getResourceKey()', 'name': 'id'}, {'description': 'Patient ID', 'path': 'subject.getReferenceKey()', 'name': 'patient_id'}, {'description': 'Observation date', 'path': 'effectiveDateTime', 'name': 'date'}], 'select': [{'forEach': 'code.coding', 'column': [{'description': 'Observation code', 'path': 'code', 'name': 'code'}]}, {'forEach': 'value.ofType(Quantity)', 'column': [{'description': 'Total cholesterol unit', 'path': 'unit', 'name': 'unit'}, {'description': 'Total cholesterol value', 'path': 'value', 'name': 'value'}]}]}], 'where': [{'description': 'Total cholesterol > 240 mg/dL', 'path': "where(code.coding.where(system = 'http://loinc.org').exists(code = '2093-3')).value.ofType(Quantity) > 240 'mg/dL'"}]}
== Physical Plan ==
Project [Observation#974.id_versioned AS id#832, Observation#974.subject.reference AS patient_id#834, Observation#974.effectiveDateTime AS date#836, @8ehd09#982.code AS code#838, Observation#974.valueQuantity.unit AS unit#840, Observation#974.valueQuantity.value AS value#842]
+- Filter isnotnull(@8ehd09#982)
   +- Generate explode(Observation#974.code.coding), [Observation#974], false, [@8ehd09#982]
      +- Project [struct(id, id#883, id_versioned, id_versioned#884, meta, meta#885, implicitRules, implicitRules#886, language, language#887, text, text#888, identifier, identifier#889, basedOn, basedOn#890, partOf, partOf#891, status, status#892, category, category#893, code, code#894, ... 62 more fields) AS Observation#974]
         +- Filter (isnotnull(valueQuantity#909) AND CASE WHEN CASE WHEN NOT CASE WHEN isnotnull(filter(filter(code#894.coding, lambdafunction((lambda x_0#1015.system = http://loinc.org), lambda x_0#1015, false)), lambdafunction((lambda x_2#1023.code = 2093-3), lambda x_2#1023, false))) THEN (size(filter(filter(code#894.coding, lambdafunction((lambda x_0#1015.system = http://loinc.org), lambda x_0#1015, false)), lambdafunction((lambda x_2#1023.code = 2093-3), lambda x_2#1023, false)), true) = 0) ELSE true END THEN (valueQuantity#909._code_canonicalized = g.m-3) ELSE false END THEN UDF(CASE WHEN NOT CASE WHEN isnotnull(filter(filter(code#894.coding, lambdafunction((lambda x_1#1016.system = http://loinc.org), lambda x_1#1016, false)), lambdafunction((lambda x_3#1024.code = 2093-3), lambda x_3#1024, false))) THEN (size(filter(filter(code#894.coding, lambdafunction((lambda x_1#1016.system = http://loinc.org), lambda x_1#1016, false)), lambdafunction((lambda x_3#1024.code = 2093-3), lambda x_3#1024, false)), true) = 0) ELSE true END THEN valueQuantity#909._value_canonicalized END, [2400,0]) ELSE false END)
            +- FileScan parquet [id#883,id_versioned#884,meta#885,implicitRules#886,language#887,text#888,identifier#889,basedOn#890,partOf#891,status#892,category#893,code#894,subject#895,focus#896,encounter#897,effectiveDateTime#898,effectiveInstant#899,effectivePeriod#900,effectiveTiming#901,issued#902,performer#903,valueBoolean#904,valueCodeableConcept#905,valueDateTime#906,... 19 more fields] Batched: false, DataFilters: [isnotnull(valueQuantity#909), CASE WHEN CASE WHEN NOT CASE WHEN isnotnull(filter(filter(code#894..., Format: Parquet, Location: PreparedDeltaFileIndex(1 paths)[file:/Users/gri306/Library/CloudStorage/OneDrive-CSIRO/Data/synth..., PartitionFilters: [], PushedFilters: [IsNotNull(valueQuantity)], ReadSchema: struct<id:string,id_versioned:string,meta:struct<id:string,versionId:string,versionId_versioned:s...

It would be better if we did not have to retrieve all of the columns for this query, only the ones that are required.

johngrimes avatar Jan 11 '24 01:01 johngrimes

I have merged main into issue/1759, and updated the version to 6.5.0-SNAPSHOT.

johngrimes avatar Jan 11 '24 01:01 johngrimes

@piotrszul Would you mind creating a pull request for this change, so that it is easier to follow activity on the branch?

johngrimes avatar Jan 11 '24 01:01 johngrimes