DataQualityDashboard
DataQualityDashboard copied to clipboard
Efficient SQL-only option to insert all results into dqdashboard_results table
As-is, the sqlOnly option has 4 main limitations:
- It requires connectivity to a live OMOP instance
- It does not include the metadata about the test being run
- It does not insert the results into the dqdashboard_results table
- It is inefficient, running all ~3000 tests serially rather than in parallel
Pull request # 301 solves all of these issues. Users can specify the # of tests to run in parallel, and they are composed into a CTE before the full set of them are inserted into the output table. In tests on Spark with 4 years of data and 3 million patients, performance improved from 16 hours to run all tests serially to about an hour to run them on a Spark cluster (e.g. DataBricks or HDInsight) and insert all results into dqdashboard_results.