documentation icon indicating copy to clipboard operation
documentation copied to clipboard

Document efficient analysis of large datasets

Open Jongmassey opened this issue 2 years ago • 0 comments

Certain study definitions can result in very large input datasets for analytical code (e.g. establishing baseline values for entire populations), especially when repeated cohort extractions are performed (e.g. monthly over a period of multiple years).

Naïve approaches to analysis of these large datasets can result in excessive RAM usage, to the detriment of successful job completion and general platform availability.

We should develop a set of general principles and recommendations for analysis of large datasets e.g.

  • input minimisation
  • correct datatyping
  • dataframe lifecycles

And perhaps a tool for crude estimation of dataset size based on number of variables, observations, and datatypes?

Jongmassey avatar Mar 04 '22 11:03 Jongmassey