pathling
pathling copied to clipboard
Inconsistent encoding of some polymorphic types
Describe the bug
Unset FHIR parameters are sometimes encoded as NULL sometimes as a complex type/struct with all values set to NULL.
For example, none of the Condition resources have the abatementAge set, yet sometimes the raw parquet files contain a NULL there, sometimes a {"id":null,"value":null,"value_scale":null,"comparator":null,"unit":null,"system":null,"code":null,"_value_canonicalized":{"value":null,"scale":null},"_code_canonicalized":null,"_fid":null}. The distribution is roughly 50:50 for a dataset of around 70.000 resources:
SELECT abatementAge, COUNT(*)
FROM delta.`s3a://pathling-warehouse/default/Condition.parquet` AS Condition
GROUP BY abatementAge;
abatementAge count(1)
NULL 35000
{"id":null,"value":null,"value_scale":null,"comparator":null,"unit":null,"system":null,"code":null,"_value_canonicalized":{"value":null,"scale":null},"_code_canonicalized":null,"_fid":null} 37000
I noticed this because https://docs.profiling.ydata.ai/latest/ would consider the NULL as a missing value and the null-struct as non-null.
Here's another screenshot:
Definitely something I can easily work-around, just thought you might be able to share some insights into why it is happening.
To Reproduce
Import a lot (?) of resources via $import and observe that seemingly randomly some unset values are encoded as NULL, some as a null-struct.
Expected behavior
I would expect all values to be encoded consistently, that being said, it's definitely not wrong in this way.
Thanks for posting this @chgl.
This definitely seems quite strange, and a defect as described. We will work on reproducing this.
Hi @chgl
could you please describe in more detail how did you produce s3a://pathling-warehouse/default/Condition.parquet file? In particular which version of Pathing and which API was used to create it from what input.
I tried to reproduce the issue importing a large number of Condition resources, but that by itself does not seem be to leading to this problem. I did manage to observe the difference between:
{"resourceType":"Condition","id":"xxx", "abatementAge":null}
and
{"resourceType":"Condition","id":"xxx"}
when the latter produces a NULL while the former and empty struct.
Could you please check if your source data does not use JSON nulls in some of the resources?
If this is not the case would you be able to identify a pair or Condition resources (and I assume Observations) that have different encodings and send us the (anonymised) JSON representations of them?
Ah, it happens if a resource is merged, either via the $import mode merge or via update-as-create.
To reproduce:
services:
minio:
image: docker.io/bitnami/minio:2024.2.17-debian-12-r2@sha256:4d04a41f9d385d51ecd9be8dafca13fe9d56be2cc1c5ea8f98e6cfb235d87ae5
environment:
MINIO_ROOT_USER: "admin"
# kics-scan ignore-line
MINIO_ROOT_PASSWORD: "miniopass" # gitleaks:allow
MINIO_DEFAULT_BUCKETS: "fhir"
ports:
- "9000:9000"
- "127.0.0.1:9001:9001"
pathling:
image: docker.io/aehrc/pathling:6.4.2@sha256:9b8ee32d4b8bb40192d6bf25814492a616153a0df15d178c286db9ec80c1c85e
environment:
pathling.storage.warehouseUrl: s3://fhir
pathling.import.allowableSources: s3://fhir/staging
pathling.terminology.enabled: false
pathling.terminology.serverUrl: http://localhost:8080/i-dont-exist
fs.s3a.endpoint: "http://minio:9000"
fs.s3a.aws.credentials.provider: "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider"
fs.s3a.access.key: "admin"
fs.s3a.secret.key: "miniopass"
fs.s3a.impl: "org.apache.hadoop.fs.s3a.S3AFileSystem"
fs.s3a.path.style.access: "true"
ports:
- "8082:8080"
depends_on:
minio:
condition: service_started
Save as compose.yaml and run
docker compose -f compose.yaml up
Using the following bundle (I don't believe the exact contents matter):
{
"resourceType": "Bundle",
"type": "transaction",
"entry": [
{
"fullUrl": "Condition/c123",
"resource": {
"resourceType": "Condition",
"id": "c123",
"code": {
"coding": [
{
"system": "http://fhir.de/CodeSystem/bfarm/icd-10-gm",
"version": "2021",
"code": "C72.0"
}
]
},
"subject": {
"reference": "Patient/p123"
}
},
"request": {
"method": "PUT",
"url": "Condition/c123"
}
}
]
}
POST once:
curl -X POST -H "Content-Type: application/fhir+json" --data "@./bundle.json" http://localhost:8082/fhir
I used duckdb to read the Parquet file:
duckdb
SET s3_region='us-east-1';
SET s3_access_key_id='admin';
SET s3_secret_access_key='miniopass';
SET s3_endpoint='localhost:9000';
SET s3_url_style='path';
SET s3_use_ssl=false;
SELECT id,abatementAge
FROM "s3://fhir/default/Condition.parquet/*.parquet";
┌─────────┬─────────────────────────────────────────────────────────────────────┐
│ id │ abatementAge │
│ varchar │ struct(id varchar, "value" decimal(32,6), value_scale integer, co… │
├─────────┼─────────────────────────────────────────────────────────────────────┤
│ c123 │ │
└─────────┴─────────────────────────────────────────────────────────────────────┘
The "empty" row means null
POST again:
curl -X POST -H "Content-Type: application/fhir+json" --data "@./bundle.json" http://localhost:8082/fhir
SELECT id,abatementAge
FROM "s3://fhir/default/Condition.parquet/*.parquet";
┌─────────┬─────────────────────────────────────────────────────────────────────┐
│ id │ abatementAge │
│ varchar │ struct(id varchar, "value" decimal(32,6), value_scale integer, co… │
├─────────┼─────────────────────────────────────────────────────────────────────┤
│ c123 │ │
│ c123 │ {'id': NULL, 'value': NULL, 'value_scale': NULL, 'comparator': NU… │
└─────────┴─────────────────────────────────────────────────────────────────────┘
You can see the most recent version is now using a struct with NULLs while the older version is just NULL.
Hi @chgl!
Good news, we were able to reproduce this problem.
It turns out that it is related to the spark.databricks.delta.schema.autoMerge.enabled configuration setting, which is currently set to true by default.
I believe you may be able to fix this problem simply by setting this to false.
If you are wondering why you are getting multiple rows for the same resource when querying with DuckDB, it is because these are Delta files. Delta is a specialisation of Parquet that includes history. Because you are updating these rows, old versions are being preserved.
These will be invisible to a Delta-aware query engine such as Spark (unless you explicitly query the history).
We're working up a new release which will change the default setting to fix this problem.
Sounds good, thanks for the update @johngrimes !