DataQualityDashboard icon indicating copy to clipboard operation
DataQualityDashboard copied to clipboard

Efficient SQL-only option to insert all results into dqdashboard_results table

Open TMSWhite opened this issue 2 years ago • 0 comments

As-is, the sqlOnly option has 4 main limitations:

  • It requires connectivity to a live OMOP instance
  • It does not include the metadata about the test being run
  • It does not insert the results into the dqdashboard_results table
  • It is inefficient, running all ~3000 tests serially rather than in parallel

Pull request # 301 solves all of these issues. Users can specify the # of tests to run in parallel, and they are composed into a CTE before the full set of them are inserted into the output table. In tests on Spark with 4 years of data and 3 million patients, performance improved from 16 hours to run all tests serially to about an hour to run them on a Spark cluster (e.g. DataBricks or HDInsight) and insert all results into dqdashboard_results.

TMSWhite avatar Jun 10 '22 22:06 TMSWhite