mimic-omop icon indicating copy to clipboard operation
mimic-omop copied to clipboard

make ETL run in Google Big Query

Open vojtechhuser opened this issue 4 years ago • 7 comments

Current code is Postgres SQL flavor specific. (e.g., ::integer) in code.

To run on other platforms, notes how to do port this is needed.

vojtechhuser avatar May 05 '20 19:05 vojtechhuser

The most straightforward way to do this might be through the BigQuery DBAPI: https://googleapis.dev/python/bigquery/latest/dbapi.html since it will likely allow for significant code reuse without having to write a lot of BigQuery "Standard SQL". However, I don't know the details of the ETL or the DBAPI well enough though to know how much additional work is required to make this happen. It may be just as much work to write the ETL from scratch to accommodate BigQuery

spfohl avatar May 05 '20 20:05 spfohl

Stanford team has done some work with this ETL and GBQ. Stay tuned for more details.

vojtechhuser avatar May 20 '20 16:05 vojtechhuser

I'm going to add myself to that list of Stanford collaborators. I have worked on the ETL in the past, have the converted MIMIC-OMOP data in BigQuery, and have been working on ETL+modeling tools

spfohl avatar May 20 '20 16:05 spfohl

Thanks @spfohl, we're looking to coordinate a single mapping, preferably within this repository.

The plan is then for the contributors to submit a well-described version of the output dataset to PhysioNet, to (1) allow the specific version used in a study to be clearly cited and (2) to avoid users having to build the OMOP version themselves.

We are in the process of identifying a technical lead for the work who can take responsibility for managing the development process (i.e. overseeing development work, code review, testing framework, etc). It would be good to chat if you have thoughts on this!

tompollard avatar May 20 '20 16:05 tompollard

I'll sync with the others working with the BigQuery pipeline and see where I can best contribute and then follow up. I don't currently have the bandwidth to take on a leadership role here, but I am highly interested and motivated in broadly improving the quality and usability of this ETL since I am involved several on-going research efforts that would benefit from that

spfohl avatar May 20 '20 17:05 spfohl

Sounds good, thanks @spfohl. We'll post updates as things develop.

We've found it difficult to decide how best to manage different multiple SQL dialects for other projects. If whoever becomes lead decides that BigQuery syntax is best, then maybe we just port this whole repo.

It's an interesting thought that there may be multiple MIMIC to OMOP mappings already out there and being used. If so, it would be an interesting study to explore how the choice of mapping contributes to the output of an analysis.

tompollard avatar May 20 '20 17:05 tompollard

I worked with @jdposada and @PriyaDesai70 on a proof of concept to see how long it would take us to convert one SQL script to BigQuery syntax. It took about 90 minutes to convert the procedure_occurence script, but I imagine that further tables would be much faster. There are just a few simple patterns that need to substituted, and some like SELECT DISTINCT ON that required some more complicated logic. See here

spfohl avatar May 28 '20 22:05 spfohl