DataQualityDashboard
DataQualityDashboard copied to clipboard
Fix sqlOnly
(1) When using sqlOnly, does not require actual database connection (2) Generated SQL inserts each query result into dqdashboard_results table with all associated metadata; and computes threshold violations as part of each query (rather than waiting until all query results have been run).
Note, presumes that user has pre-created the dqdashboard_results table prior to running the SQL. The output of this has been tested and validated on Hive and HDInsight-flavor of Spark SQL.
Thanks so much, this is great! I will review and send you any feedback.
Any additional feedback on this? I have now validated that it also works on DataBricks in addition to HDInsight. However, since it only runs one query at a time, it doesn't take full advantage of a Spark environment. It would be better if there were a parameter to specify how many concurrent sessions to expect (such as 20) and than have it generate a union of those queries before doing the insert. Do you know if anyone is working on optimizing DQD for such parallel computation?
@clairblacketer , I have further refined this. There is now a @sqlOnlyUnionCount parameter. When it is set to a value > 1, that many of the SQL queries are unioned into a cte before inserting the results into the output table. This allows for faster overall processing, as Spark clusters and other optimizers can run a large number of queries in parallel, even when running from pre-generated SQL scripts.
Here is sample code to run this:
library(DataQualityDashboard)
connectionDetails <- createConnectionDetails(dbms = "spark", server = "[OHDSI_SERVER_HOST]")
DataQualityDashboard::executeDqChecks(connectionDetails = connectionDetails,
cdmDatabaseSchema = "@cdm_database_schema",
resultsDatabaseSchema = "@results_database_schema",
vocabDatabaseSchema = "@vocabulary_database_schema",
cdmSourceName = "[data_source]",
numThreads = 1,
sqlOnly = TRUE,
sqlOnlyUnionCount = 20,
outputFolder = "C:\\tmp\\dqd20",
outputFile = "dqd.sql",
verboseMode = TRUE,
writeToTable = TRUE,
checkLevels = c("TABLE", "FIELD", "CONCEPT"),
tablesToExclude = c() ,
checkNames = c() )
Merged with new develop branch. Refactored the new sqlOnly logic, and moved them to separate functions in R/sqlOnly.R. Still need to do some final testing and comparison to a normal run.
A few differences I already noticed:
- The new check statuses (NA, error) cannot be evaluated (checks not run yet)
- No individual query performance results
- If one check fails, all checks in the query fail (e.g. all cdmField checks fail if OFFSET fails)
- Running sqlOnly, with union of multiple queries, seems to be much faster (at least 10x in my tests)
Todo:
- [x] Compare results to normal run
- [x] Add example script to run in sqlOnly mode
While doing final testing, I run into an issue with SqlRender, where the rendered 'plausibleValueLow' and 'plausibleValueHigh' queries become malformed. Issue reported here: https://github.com/OHDSI/SqlRender/issues/304 For now, the CONCEPT level checks will not work.
I did compare a normal and sqlOnly run, and the same number of checks fail. The only difference is that in the normal run, there are checks with NA status.
The code can be tested with extras/codeToRun_sqlOnly.R
The issue with rendering plausibleValueLow' and 'plausibleValueHigh' checks will be fixed in SqlRender v1.10.1 Final comments with tests executing also CONCEPT level checks:
- The check results are identical (see below), except for the 'OFFSET' error that I fixed manually in the sqlOnly check
- On my test data (1k), there was no difference in execution time. Both took about 1.5min, where for sqlOnly this was 1min to generate the check sql and 30sec to execute and process the check results.


TODO:
- [x] Pending SqlRender v1.10.1 release
- [ ] Resolve conflicts with develop
- [ ] Add vignette on using sqlOnly (based on extras/codeToRun_sqlOnly.R)
@MaximMoinat or @clairblacketer , any update on whether this capability (especially using sqlOnlyUnionCount to generate CTE that run multiple queries simultaneously) can be incorporated into Version 2?
Hi Tom, thanks for checking in on this. I would also love to see this merged.
Would you be able to make a start for implementing the sqlOnlyUnionCount? In parallel I can start solving the merge conflicts.
After that we need to complete the following steps:
- Add documentation, ideally in the form of a vignet.
- Final testing
- Review and merge into dev by package maintainer
@MaximMoinat , I checked out the latest code from the fix_sqlOnly branch, and confirmed that sqlOnlyUnionCount is working correctly when I call it from the new codeToRun_sqlOnly.R script. Performance is excellent on Databricks. Using sqlOnlyUnionCount = 100, II was able to compute all 3515 tests in 33 min - despite some tables having > 5B rows.
How else can I help?
Note, I spent a couple of hours floundering trying to look at merge conflicts - so I'm hoping you can lead that part of the effort.
@MaximMoinat , nevermind about help with merging. I got the merge to succeed, and am now validating.
@MaximMoinat and @clairblacketer , I submitted a different pull request from the Fork that I'm actively using. It is pull request #443 . I was able to successfully merge all of Maxim's changes and confirmed that it runs correctly on my Databricks Spark instance. You can close this pull request (#301).
Replaced this with pull request #443.