fhir-data-pipes icon indicating copy to clipboard operation
fhir-data-pipes copied to clipboard

Unable to create Individual columns of nested array of objects in a dataframe

Open suriyan3 opened this issue 2 years ago • 9 comments
trafficstars

I am creating view here

views = r4.base_r4() qrs = views.view_of('QuestionnaireResponse')

QuestionnaireResponse_df = runner.to_dataframe( qrs.select( { "qrs_id": qrs.id, "qrs_tag": qrs.meta.tag.code, "qrs_patient_id":qrs.subject.reference, "qrs_encounter_id": qrs.encounter.reference, "qrs_questionnaire": qrs.questionnaire, "qrs_item":qrs.item } ) ) In this dataframe qrs_item column holds string of array of objects , I am not able to directly convert that to array of objects and assign it to qrs_item or assign individual objects to different columns like the below

runner.to_dataframe( qrs.select( { "qrs_item":qrs.item[0][0] "qrs_item_2.0":"qrs_item[0][1] } )

so I needed an helper function to convert that string to json

Function to convert JSON string to list

def convert_to_list(json_str): try: return json.loads(json_str) except json.JSONDecodeError: return [] QuestionnaireResource_df.zip

Convert the qrs_item column to list

QuestionnaireResponse_df ['qrs_item'] = QuestionnaireResponse_df ['qrs_item'].apply(convert_to_list)

And if I want to run an expression on this qrs_item I need to loop through all the items which is difficult if the item has many nested items. Is there any effective solution to solve this issue.

suriyan3 avatar Jul 14 '23 13:07 suriyan3

I have attached the sample data and queries for reference.

suriyan3 avatar Jul 14 '23 13:07 suriyan3

A few points:

  • Does that make sense to write FHIRPath statements for specific questions/answers that you need from the QR? I mean, instead of extracting all items into a single column, separate items, e.g., based on their linkId using constraints on the FHIRPath.
  • In general, loading most data in all resources of a specific type into a Pandas dataframe is not a great idea as you may hit memory limit issues if the data gets too big. It is probably better to filter-out as much data as possible using the FHIRPath constraints when creating FHIR-views.
  • Have you tried writing direct SQL queries against the Parquet files, similar to the pattern suggested here? That doesn't directly solve the specific problem you mentioned but I am curious if the flattening ideas suggested there suit your needs.

bashir2 avatar Jul 21 '23 06:07 bashir2

  1. Actually this QR table will be created with the necessary columns and then it will be joined with Encounter table. On the final table we will be running expressions to get the values. Can I run FHIRPATH statements directly on the Parquet files? I have tried running FHIRPath statements on parquet files and it was not working.
  2. Yes I have tried writing direct SQL queries against the Parquet files using DBeaver but facing issues to create columns for nested items.

suriyan3 avatar Jul 21 '23 06:07 suriyan3

Re 1: Well a subset of FHIRPath functions are supported, that's the way that we define each "column" in a view. For example, take a look at examples here where the first() function of FHIRPath is used to choose one member of a collection. Are you saying that even simple examples like first() does not work on Parquet+Spark OR is it more advanced FHIRPaths that don't work? If it is the latter, can you provide some examples?

bashir2 avatar Jul 25 '23 02:07 bashir2

Ok, So when I create a dataframe using

QuestionnaireResponse_df = runner.to_dataframe( qrs.select( { "qrs_id": qrs.id, "qrs_tag": qrs.meta.tag.code, "qrs_patient_id":qrs.subject.reference, "qrs_encounter_id": qrs.encounter.reference, "qrs_questionnaire": qrs.questionnaire, "qrs_item":qrs.item } ) )

when I do QuestionnaireResponse_df.iloc[0]['item'] it gives me this nested item as a string value

'[{"id":null,"linkId":"vitals","definition":null,"text":null,"answer":null,"item":[{"id":null,"linkId":"3.1","definition":null,"text":"ANC service provided at *"},{"id":null,"linkId":"service-type-details","definition":null,"text":null},{"id":null,"linkId":"machine-available","definition":null,"text":"Is BP Machine Available? *"},{"id":null,"linkId":"3.2","definition":null,"text":null},{"id":null,"linkId":"3.3","definition":null,"text":null},{"id":null,"linkId":"3.4","definition":null,"text":null},{"id":null,"linkId":"3.5","definition":null,"text":null},{"id":null,"linkId":"3.6","definition":null,"text":"Blood Group *"},{"id":null,"linkId":"3.10","definition":null,"text":null},{"id":null,"linkId":"3.11","definition":null,"text":"Rh factor *"},{"id":null,"linkId":"3.12","definition":null,"text":null}]},{"id":null,"linkId":"counselling","definition":null,"text":null,"answer":null,"item":[{"id":null,"linkId":"4.1","definition":null,"text":"Was counselling done for the following"},{"id":null,"linkId":"4.2","definition":null,"text":null},{"id":null,"linkId":"4.4","definition":null,"text":null},{"id":null,"linkId":"4.6","definition":null,"text":null}]},{"id":null,"linkId":"syphilis","definition":null,"text":null,"answer":null,"item":[{"id":null,"linkId":"5.1.0","definition":null,"text":null},{"id":null,"linkId":"5.3.0","definition":null,"text":null}]},{"id":null,"linkId":"hepatitis-B","definition":null,"text":null,"answer":null,"item":[{"id":null,"linkId":"6.1.0","definition":null,"text":null},{"id":null,"linkId":"6.3.0","definition":null,"text":null}]},{"id":null,"linkId":"hepatitis-C","definition":null,"text":null,"answer":null,"item":[{"id":null,"linkId":"7.1.0","definition":null,"text":null},{"id":null,"linkId":"7.3.0","definition":null,"text":null}]},{"id":null,"linkId":"hiv","definition":null,"text":null,"answer":null,"item":[{"id":null,"linkId":"8.1.0","definition":null,"text":null}]},{"id":null,"linkId":"malaria","definition":null,"text":null,"answer":null,"item":[{"id":null,"linkId":"malaria-1.0","definition":null,"text":null},{"id":null,"linkId":"malaria-3.0","definition":null,"text":null}]},{"id":null,"linkId":"SickleCellTest","definition":null,"text":"Sickle Cell and HbsAg test","answer":null,"item":[{"id":null,"linkId":"9.1.0","definition":null,"text":null},{"id":null,"linkId":"9.3.0","definition":null,"text":null}]},{"id":null,"linkId":"hematology","definition":null,"text":null,"answer":null,"item":[{"id":null,"linkId":"10.1.0","definition":null,"text":null}]},{"id":null,"linkId":"gestational-diabetes","definition":null,"text":null,"answer":null,"item":[{"id":null,"linkId":"11.1","definition":null,"text":null}]},{"id":null,"linkId":"urinalysis","definition":null,"text":null,"answer":null,"item":[{"id":null,"linkId":"12.1.0","definition":null,"text":null}]},{"id":null,"linkId":"Foetal-Examination","definition":null,"text":null,"answer":null,"item":[{"id":null,"linkId":"13.1","definition":null,"text":"Foetal Movement"},{"id":null,"linkId":"13.3","definition":null,"text":null},{"id":null,"linkId":"13.5","definition":null,"text":null},{"id":null,"linkId":"13.6","definition":null,"text":null},{"id":null,"linkId":"13.7","definition":null,"text":"Foetal Position"},{"id":null,"linkId":"13.11","definition":null,"text":null}]},{"id":null,"linkId":"hemanitic-calcium-supplements","definition":null,"text":null,"answer":null,"item":[{"id":null,"linkId":"15.1.0","definition":null,"text":null},{"id":null,"linkId":"15.2.0","definition":null,"text":null}]},{"id":null,"linkId":"llin-and-ipt-test","definition":null,"text":null,"answer":null,"item":[{"id":null,"linkId":"16.1.0","definition":null,"text":null},{"id":null,"linkId":"16.2.0","definition":null,"text":"Was IPT given?"}]},{"id":null,"linkId":"tetanus-diptheria","definition":null,"text":null,"answer":null,"item":[{"id":null,"linkId":"17.1.0","definition":null,"text":"Was Td given?"}]},{"id":null,"linkId":"associated-problems","definition":null,"text":null,"answer":null,"item":[{"id":null,"linkId":"18.1.0","definition":null,"text":"Associated problems"}]},{"id":null,"linkId":"previous-medical-history","definition":null,"text":null,"answer":null,"item":[{"id":null,"linkId":"19.1.0","definition":null,"text":"Previous medical history"}]},{"id":null,"linkId":"outcome-of-visit","definition":null,"text":null,"answer":null,"item":[{"id":null,"linkId":"20.1","definition":null,"text":null}]}]'

while creating the dataframe itself when I change to list or json system restarts or system goes unresponsive QuestionnaireResponse_df = runner.to_dataframe( qrs.select( { "qrs_id": qrs.id, "qrs_tag": qrs.meta.tag.code, "qrs_patient_id":qrs.subject.reference, "qrs_encounter_id": qrs.encounter.reference, "qrs_questionnaire": qrs.questionnaire, "qrs_item":list(qrs.item) // json.loads(qrs.item) } ) )

suriyan3 avatar Jul 25 '23 07:07 suriyan3

Update: what I meant by FHIRPath constraints was, for example, like this:

qr.select(
        {
            "id": qr.id,
            "answer1_0": qr.item.where(qr.item.linkId=='1.0').answer.first().value.ofType('coding').code,
        }
    )

This works and the advantage is that it picks a specific answer and assigns it to a single "column" so you don't need to deal with json/arrays, etc. However, when I tried this with qr.item.item it failed. It seems to be a bug in the fhir-views code. I have informed the team and they are looking into it.

In the meantime doing the same with SQL is probably your best bet. It also avoids loading everything into a dataframe in memory.

bashir2 avatar Jul 26 '23 18:07 bashir2

Update: The mentioned bug with qr.item.item was fixed by this commit (thanks to @rbrush) but then the current Spark runner implementation failed at generating proper SQL for constraints on qr.item.item. This is another issue which we need to look into. However, I like to suggest to reconsider doing direct SQL in the meantime. Here are some examples that hopefully should help:

The basic idea is to EXPLODE repeated columns to get a flat view and then apply required constraints. In your case, because you have answer fields at different levels, (e.g., item.item.answer vs item.item.item.answer), this is a little bit trickier but still doable. First, if you try this query (please pay attention to whether you have a _ in your table name or not):

SELECT QR.id,
  (CASE item_2.linkId WHEN '1.1' THEN item_2.answer[0].value.coding.code ELSE NULL END) AS answer_1_1,
  (CASE item_2.linkId WHEN '3.1' THEN item_2.answer[0].value.coding.code ELSE NULL END) AS answer_3_1,
  (CASE item_3.linkId WHEN '1.2.5' THEN item_3.answer[0].value.coding.code ELSE NULL END) AS answer_1_2_5,
  (CASE item_4.linkId WHEN 'v-value' THEN item_4.answer[0].value.coding.code ELSE NULL END) AS answer_v_value
FROM questionnaire_response AS QR
  LATERAL VIEW OUTER EXPLODE(QR.item) AS item_1
  LATERAL VIEW OUTER EXPLODE(item_1.item) AS item_2
  LATERAL VIEW OUTER EXPLODE(item_2.item) AS item_3
  LATERAL VIEW OUTER EXPLODE(item_3.item) AS item_4

you will see that for each QR.id, you get many rows. In the four answer columns, there are a lot of NULL values but there are also the corresponding answers. So we just need to aggregate these as follows:

SELECT QR.id,
  FIRST((CASE item_2.linkId WHEN '1.1' THEN item_2.answer[0].value.coding.code ELSE NULL END), true) AS answer_1_1,
  FIRST((CASE item_2.linkId WHEN '3.1' THEN item_2.answer[0].value.coding.code ELSE NULL END), true) AS answer_3_1,
  FIRST((CASE item_3.linkId WHEN '1.2.5' THEN item_3.answer[0].value.coding.code ELSE NULL END), true) AS answer_1_2_5,
  FIRST((CASE item_4.linkId WHEN 'v-value' THEN item_4.answer[0].value.coding.code ELSE NULL END), true) AS answer_v_value
FROM questionnaire_response AS QR
  LATERAL VIEW OUTER EXPLODE(QR.item) AS item_1
  LATERAL VIEW OUTER EXPLODE(item_1.item) AS item_2
  LATERAL VIEW OUTER EXPLODE(item_2.item) AS item_3
  LATERAL VIEW OUTER EXPLODE(item_3.item) AS item_4
GROUP BY QR.id;

We are using the FIRST aggregation function, dropping NULL values. I am not sure about the performance of this approach but it at least unblocks you.

bashir2 avatar Jul 31 '23 22:07 bashir2

Just to clarify re performance: A major concern is the GROUP BY part which is not really required to achieve what we are doing here; i.e., all answer elements should be extracted in the context of a single row without requiring a table-wide GROUP BY. I believe that's how fhir-views work but its generated SQL is too complex, hence this simpler approach.

bashir2 avatar Aug 01 '23 01:08 bashir2

Thanks @bashir I have tested this SQL queries it is working fine, we can capture the columns we need and create a flat table. Is it possible to write a query that can loop through all the objects and in each objects there might be nested items and answer so if answer is not null then can we create a column name from the value of text key and its column value be taken from answer array.

suriyan3 avatar Aug 03 '23 07:08 suriyan3

Besides the solution offered above, now with the ViewDefinition support we can create flat-views directly from the original QuestionnaireResponse resources (with FHIRPaths to extract item and answer at arbitrary levels). Closing this issue ...

bashir2 avatar Sep 30 '24 16:09 bashir2