documentation icon indicating copy to clipboard operation
documentation copied to clipboard

Document making weekly/monthly queries more efficient

Open HelenCEBM opened this issue 3 years ago • 1 comments

When to use weekly/monthly query

It is usually much more efficient on processing and file storage space, where possible, not to use a weekly/monthly query.

Major reasons to use a weekly/monthly query are:

  • key outcomes are "number of events per week/month"
  • patients may have the outcome of interest several times in the study period and you need to know information about each event (other than simply how many events there were).

How to maximise efficiency

The processing/storage burden of these queries depends largely on population size and number of periods, and to some extent the number of variables, so try to minimise these as much as possible.

Basics (applicable to most studies):

  • limit the included population as much as possible e.g. exclude children if not needed, or people with missing values for key variables.
  • Minimise variables:
    • make sure you've removed any variables you don't need
    • "nest" queries so that where possible only the fields that will be used are returned (this works with functions patients.satisfying() and patients.categorised_as()).
    • simplify returning options e.g. return binary_flag instead of date if dates aren't needed (they typically won't be for weekly/monthly analyses).
  • use feather format for outputs (unless processing in Stata, where csv.gz can be used instead)

Structure:

  • group any variables together that can be assessed at a single time point e.g. ethnicity, and just run those once in a separate study definition; use cohort-joiner (a reusable action) to join it to the weekly/monthly extracts.
  • Options to consider:
    • split the cohort extraction into multiple steps (e.g. one action per one-year period), so a) if something goes wrong you might still get some files out and b) it will parallelise if there's capacity on the server
    • rather than including the entire population in the "main" study definition with all the variables, limit it to only people within groups of interest ("numerator") and run a separate study definition for the whole population ("denominator") with only the minimal variables. This is most helpful if you have lots of variables of interest for the numerator, and the numerator population is relatively small. This approach can also help to reduce the number of patients for which the main study has to be re-run if you e.g. need to change the way one of the groups is defined, as the denominator could remain the same and would not need to be re-run.
  • use the measures framework. This will produce a convenient set of aggregated files for further analysis.
    • use measure IDs that allow you to only capture the aggregated measure files rather than the weekly/monthly files as outputs in the yaml. This means that fewer files need to be stored and imported into the next analysis step, which makes testing locally and running in the server much easier/faster. E.g. end all measure ID's in "rate" so you can capture the outputs with measure*_rate.csv without capturing the monthly measures files (which will end with dates e.g. "measure_1_rate_2022-01-01.csv"). (Link to example).

Running:

  • Do the first run (and local testing) on a small sample of weeks/months or a subset of the population.
  • Avoid running the study in multiple workspaces as the files will be duplicated for each one (i.e. when making changes, review and merge into the main branch before running again).
  • Let us know if you have old large files in the server that can be deleted.

HelenCEBM avatar Feb 11 '22 17:02 HelenCEBM