pathling icon indicating copy to clipboard operation
pathling copied to clipboard

Inconsistent encoding of some polymorphic types

Open chgl opened this issue 1 year ago • 5 comments

Describe the bug

Unset FHIR parameters are sometimes encoded as NULL sometimes as a complex type/struct with all values set to NULL.

For example, none of the Condition resources have the abatementAge set, yet sometimes the raw parquet files contain a NULL there, sometimes a {"id":null,"value":null,"value_scale":null,"comparator":null,"unit":null,"system":null,"code":null,"_value_canonicalized":{"value":null,"scale":null},"_code_canonicalized":null,"_fid":null}. The distribution is roughly 50:50 for a dataset of around 70.000 resources:

SELECT abatementAge, COUNT(*)
FROM delta.`s3a://pathling-warehouse/default/Condition.parquet` AS Condition
GROUP BY abatementAge;
abatementAge	count(1)
NULL	35000
{"id":null,"value":null,"value_scale":null,"comparator":null,"unit":null,"system":null,"code":null,"_value_canonicalized":{"value":null,"scale":null},"_code_canonicalized":null,"_fid":null}	37000

I noticed this because https://docs.profiling.ydata.ai/latest/ would consider the NULL as a missing value and the null-struct as non-null.

Here's another screenshot:

image

Definitely something I can easily work-around, just thought you might be able to share some insights into why it is happening.

To Reproduce

Import a lot (?) of resources via $import and observe that seemingly randomly some unset values are encoded as NULL, some as a null-struct.

Expected behavior

I would expect all values to be encoded consistently, that being said, it's definitely not wrong in this way.

chgl avatar Apr 02 '24 13:04 chgl

Thanks for posting this @chgl.

This definitely seems quite strange, and a defect as described. We will work on reproducing this.

johngrimes avatar Apr 08 '24 00:04 johngrimes

Hi @chgl

could you please describe in more detail how did you produce s3a://pathling-warehouse/default/Condition.parquet file? In particular which version of Pathing and which API was used to create it from what input.

I tried to reproduce the issue importing a large number of Condition resources, but that by itself does not seem be to leading to this problem. I did manage to observe the difference between:

{"resourceType":"Condition","id":"xxx", "abatementAge":null} and {"resourceType":"Condition","id":"xxx"}

when the latter produces a NULL while the former and empty struct.

Could you please check if your source data does not use JSON nulls in some of the resources?

If this is not the case would you be able to identify a pair or Condition resources (and I assume Observations) that have different encodings and send us the (anonymised) JSON representations of them?

piotrszul avatar Apr 11 '24 05:04 piotrszul

Ah, it happens if a resource is merged, either via the $import mode merge or via update-as-create.

To reproduce:

services:
  minio:
    image: docker.io/bitnami/minio:2024.2.17-debian-12-r2@sha256:4d04a41f9d385d51ecd9be8dafca13fe9d56be2cc1c5ea8f98e6cfb235d87ae5
    environment:
      MINIO_ROOT_USER: "admin"
      # kics-scan ignore-line
      MINIO_ROOT_PASSWORD: "miniopass" # gitleaks:allow
      MINIO_DEFAULT_BUCKETS: "fhir"
    ports:
      - "9000:9000"
      - "127.0.0.1:9001:9001"

  pathling:
    image: docker.io/aehrc/pathling:6.4.2@sha256:9b8ee32d4b8bb40192d6bf25814492a616153a0df15d178c286db9ec80c1c85e
    environment:
      pathling.storage.warehouseUrl: s3://fhir
      pathling.import.allowableSources: s3://fhir/staging
      pathling.terminology.enabled: false
      pathling.terminology.serverUrl: http://localhost:8080/i-dont-exist
      fs.s3a.endpoint: "http://minio:9000"
      fs.s3a.aws.credentials.provider: "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider"
      fs.s3a.access.key: "admin"
      fs.s3a.secret.key: "miniopass"
      fs.s3a.impl: "org.apache.hadoop.fs.s3a.S3AFileSystem"
      fs.s3a.path.style.access: "true"
    ports:
      - "8082:8080"
    depends_on:
      minio:
        condition: service_started

Save as compose.yaml and run

docker compose -f compose.yaml up

Using the following bundle (I don't believe the exact contents matter):

{
  "resourceType": "Bundle",
  "type": "transaction",
  "entry": [
    {
      "fullUrl": "Condition/c123",
      "resource": {
        "resourceType": "Condition",
        "id": "c123",
        "code": {
          "coding": [
            {
              "system": "http://fhir.de/CodeSystem/bfarm/icd-10-gm",
              "version": "2021",
              "code": "C72.0"
            }
          ]
        },
        "subject": {
          "reference": "Patient/p123"
        }
      },
      "request": {
        "method": "PUT",
        "url": "Condition/c123"
      }
    }
  ]
}

POST once:

curl -X POST -H "Content-Type: application/fhir+json" --data "@./bundle.json" http://localhost:8082/fhir

I used duckdb to read the Parquet file:

duckdb
SET s3_region='us-east-1';
SET s3_access_key_id='admin';
SET s3_secret_access_key='miniopass';
SET s3_endpoint='localhost:9000';
SET s3_url_style='path';
SET s3_use_ssl=false;

SELECT id,abatementAge
  FROM "s3://fhir/default/Condition.parquet/*.parquet";
┌─────────┬─────────────────────────────────────────────────────────────────────┐
│   id    │                            abatementAge                             │
│ varchar │ struct(id varchar, "value" decimal(32,6), value_scale integer, co…  │
├─────────┼─────────────────────────────────────────────────────────────────────┤
│ c123    │                                                                     │
└─────────┴─────────────────────────────────────────────────────────────────────┘

The "empty" row means null

POST again:

curl -X POST -H "Content-Type: application/fhir+json" --data "@./bundle.json" http://localhost:8082/fhir
SELECT id,abatementAge
  FROM "s3://fhir/default/Condition.parquet/*.parquet";
┌─────────┬─────────────────────────────────────────────────────────────────────┐
│   id    │                            abatementAge                             │
│ varchar │ struct(id varchar, "value" decimal(32,6), value_scale integer, co…  │
├─────────┼─────────────────────────────────────────────────────────────────────┤
│ c123    │                                                                     │
│ c123    │ {'id': NULL, 'value': NULL, 'value_scale': NULL, 'comparator': NU…  │
└─────────┴─────────────────────────────────────────────────────────────────────┘

You can see the most recent version is now using a struct with NULLs while the older version is just NULL.

chgl avatar Apr 11 '24 09:04 chgl

Hi @chgl!

Good news, we were able to reproduce this problem.

It turns out that it is related to the spark.databricks.delta.schema.autoMerge.enabled configuration setting, which is currently set to true by default.

I believe you may be able to fix this problem simply by setting this to false.

If you are wondering why you are getting multiple rows for the same resource when querying with DuckDB, it is because these are Delta files. Delta is a specialisation of Parquet that includes history. Because you are updating these rows, old versions are being preserved.

These will be invisible to a Delta-aware query engine such as Spark (unless you explicitly query the history).

We're working up a new release which will change the default setting to fix this problem.

johngrimes avatar Apr 17 '24 01:04 johngrimes

Sounds good, thanks for the update @johngrimes !

chgl avatar Apr 19 '24 09:04 chgl