Achilles
Achilles copied to clipboard
Optimized queries 717, 815, and 1815; reduces run-time on Spark from …
…hours to about 30 seconds
Fixes issue #717
@TomWhite-MedStar , I checked in with our techs and they reminded me that we took some steps to reduce the usage of CTEs because of the challenges of their performance/optimization on certain platforms (ie: redshift CTEs seem to be replicated across the cluster, ignoring any table partitioning on the tables contained within).
Can you refactor this into 3 temp tables:
- agg
- ordbyval
- cutpoints
SqlRender has 'hints' coded so that we can control the dist key on tables. In this case, I think we want to dist on the measurement_concept_id, drug_concept_id, etc.
It's done like this:
--HINT DISTRIBUTE_ON_KEY(measurement_concept_id)
SELECT o.measurement_concept_id
, o.unit_concept_id
, COUNT_BIG(*) AS num_recs
, MIN(o.value_as_number) AS min_value
, MAX(o.value_as_number) AS max_value
, CAST(AVG(1.0 * o.value_as_number) AS FLOAT) AS avg_value
, CAST(STDDEV(o.value_as_number) AS FLOAT) AS stdev_value
INTO #agg_1815
FROM ....
....
;
--- other temp tables
--- final temp insert:
-- Select final values for inclusion INTO achilles_results_dist
-- HINT DISTRIBUTE_ON_RANDOM
SELECT 1815 AS analysis_id
, a.measurement_concept_id AS stratum_1
, a.unit_concept_id AS stratum_2
, a.num_recs AS count_value
, a.min_value
, a.max_value
, a.avg_value
, a.stdev_value
, CASE WHEN c.median_value IS NULL THEN a.max_value ELSE c.median_value END as median_value
, CASE WHEN c.p10_value IS NULL THEN a.max_value ELSE c.p10_value END as p10_value
, CASE WHEN c.p25_value IS NULL THEN a.max_value ELSE c.p25_value END as p25_value
, CASE WHEN c.p75_value IS NULL THEN a.max_value ELSE c.p75_value END as p75_value
, CASE WHEN c.p90_value IS NULL THEN a.max_value ELSE c.p90_value END as p90_value
INTO #tempResults_1815
FROM #agg_1815 a
JOIN #calc_1815 c
ON a.measurement_concept_id = c.measurement_concept_id
AND a.unit_concept_id = c.unit_concept_id
ORDER BY a.measurement_concept_id
, a.unit_concept_id
;
-- Drop the temp tables agg, calc, cutpoints, etc
I suggest putting the analysis_id as a suffix of the temp tables you're extracting from the CTEs, because if you have an analysis query where the same tables are being created (ie: #agg
for the measurement, drug, etc temp tables) you could get a parse error saying that a table has been declared twice even tho there is a drop table #agg
between them.
The --HINT DISTRIBUTE_ON_KEY(measurement_concept_id)
will instruct SqlRender to craft the CTAS to distribute on the key. This is important because you don't want to replicate the measurement concepts across the cluster, you want to keep those concepts together because you join on them later.
Update: I added a HINT DISTRIBUTE_ON_RANDOM
on the last select..into
to avoid another table replication: In the final aggregation, we can distribute the rows randomly because we're not joining back to that table later. We could distribute on the stratum_1 column if we wanted, but I don't think it matters in this case. you can try it, but it only matters in Spark if spark handles these hints (here is an example for redshift).
Let me know if this makes sense and it's something you can refactor.
-Chris
@TomWhite-MedStar Thanks for your PR. @chrisknoll is correct with his recommendations regarding our limiting use of CTE. Could you provide hardware specifications of your Spark environment? Do you have any performance benchmarks on how these changes would impact other database platforms?
@fdefalco and @chrisknoll , I refactored the queries as you instructed, and it is still good performance on Databricks.
For our spark environment, we are using a Databricks "Small Warehouse". Our measurements and observation tables have about 5.5B rows each, but 815 and 1815 both fully process in less than a minute.
Ok thanks. The Achilles 1.7.1 release is currently pending CRAN approval. Achilles is also being added to HADES as it has not achieved compliance with HADES standards. This might delay our progress in incorporating this PR. In the meantime we will be able to perform testing of these optimizations across other platforms and then incorporate into at 1.7.2 release if everything checks out.
@fdefalco , congratulations on the 1.7.2 release. Please let me know if you need any more information about my two PRs.
Hi @fdefalco, Is there any timeline for when the PR will be reviewed? I can review and merge if you'd like.
Hi @fdefalco, Is there any timeline for when the PR will be reviewed? I can review and merge if you'd like.
Current priority is having the CommonDataModel package added to CRAN. That is in progress with submissions to CRAN on both Friday and then Saturday after resolving some issues. Once CommonDataModel package is accepted we'll be able to push the Eunomia changes and have it pushed to CRAN as well.
My current issue with this PR is that we haven't addressed consistency questions (does the updated query provide consistent results with the original query) and performance implications on platforms other than DataBricks. I now have access to a DataBricks environment so I should be able to test those last two items in my environment. Priority still resides with CommonDataModel and Eunomia before another release of Achilles.
Thanks for the update!
It makes sense to ensure consistency (no change in output on all supported dbms) and performance on dbms other than spark. I think this could be done with unit tests that run in the github CI process. This way we always know this information for any PR on Achilles.
I see there are tests that run with each push on several dbms. But I'm not sure if they actually run when someone makes a PR from a separate external branch due to the github secrets not being available.
We can use snapshot testing to make sure that the results of running achilles (csv files) are unchanged.
@fdefalco @TomWhite-MedStar , I was checking the changes in Redshift and found that results of the updated queries differ from what we see using version from develop branch. Row counts are same, but some values don't match.
Below are few examples. Lines marked with "old" were generated using develop branch, "new" - using TomWhite-MedStar:bugfix_717. "Old" and "new" lines go one after another, so it's easy to see the differences.
rn | kind | analysis_id | count_value | min_value | max_value | avg_value | stdev_value | median_value | p10_value | p25_value | p75_value | p90_value
---+------+-------------+-------------+-----------+-----------+-----------+-------------+--------------+-----------+-----------+-----------+----------
1 | old | 717 | 2 | 4 | 6 | 5 | 1.41 | 4 | 4 | 4 | 6 | 6
1 | new | 717 | 2 | 4 | 6 | 5 | 1.41 | 4 | 4 | 4 | 4 | 4
2 | old | 717 | 5 | 12 | 210 | 72.4 | 79.11 | 50 | 12 | 30 | 60 | 210
2 | new | 717 | 5 | 12 | 210 | 72.4 | 79.11 | 30 | 12 | 12 | 50 | 60
3 | old | 717 | 3 | 1 | 3 | 2 | 1 | 2 | 1 | 1 | 3 | 3
3 | new | 717 | 3 | 1 | 3 | 2 | 1 | 1 | 1 | 1 | 2 | 2
rn | kind | analysis_id | count_value | min_value | max_value | avg_value | stdev_value | median_value | p10_value | p25_value | p75_value | p90_value
---+------+-------------+-------------+-----------+-----------+-----------+-------------+--------------+-----------+-----------+-----------+----------
1 | old | 815 | 2 | 0 | 730 | 365 | 516.19 | 0 | 0 | 0 | 730 | 730
1 | new | 815 | 2 | 0 | 730 | 365 | 516.19 | 0 | 0 | 0 | 0 | 0
2 | old | 815 | 4 | 0 | 1240 | 730 | 521.92 | 840 | 0 | 0 | 840 | 1240
2 | new | 815 | 4 | 0 | 1240 | 730 | 521.92 | 840 | 0 | 0 | 840 | 840
3 | old | 815 | 23 | 19.69 | 40.89 | 31.61 | 6.53 | 32.2 | 22.89 | 27.3 | 38.1 | 40.79
3 | new | 815 | 23 | 19.69 | 40.89 | 31.6 | 6.53 | 30.5 | 21.89 | 25 | 37.1 | 38.7
rn | kind | analysis_id | count_value | min_value | max_value | avg_value | stdev_value | median_value | p10_value | p25_value | p75_value | p90_value
---+------+-------------+-------------+-----------+-----------+-----------+-------------+--------------+-----------+-----------+-----------+----------
1 | old | 1815 | 5 | 0 | 154.5 | 31.12 | 68.97 | 0 | 0 | 0 | 1.1 | 154.5
1 | new | 1815 | 5 | 0 | 154.5 | 31.12 | 68.97 | 0 | 0 | 0 | 0 | 1.1
2 | old | 1815 | 11 | 1.8 | 20.69 | 9.08 | 6.19 | 8.19 | 2.1 | 5.4 | 12.1 | 19.5
2 | new | 1815 | 11 | 1.8 | 20.69 | 9.07 | 6.19 | 7.29 | 1.8 | 2.1 | 8.69 | 12.1
3 | old | 1815 | 6 | 13.9 | 50.7 | 28.91 | 13.92 | 27.19 | 13.9 | 14.19 | 34.89 | 50.7
3 | new | 1815 | 6 | 13.9 | 50.7 | 28.91 | 13.92 | 27.19 | 13.9 | 13.9 | 32.6 | 34.89
It looks as though only the median and percentile values differ. I'll double check my logic. For median, I suspect I didn't do conditional logic based upon odd vs. even number of values.
I believe these have been addressed by PR #761 and #756