documentation Document efficient analysis of large datasets

Document efficient analysis of large datasets

Open Jongmassey opened this issue 2 years ago • 0 comments

Certain study definitions can result in very large input datasets for analytical code (e.g. establishing baseline values for entire populations), especially when repeated cohort extractions are performed (e.g. monthly over a period of multiple years).

Naïve approaches to analysis of these large datasets can result in excessive RAM usage, to the detriment of successful job completion and general platform availability.

We should develop a set of general principles and recommendations for analysis of large datasets e.g.

input minimisation
correct datatyping
dataframe lifecycles

And perhaps a tool for crude estimation of dataset size based on number of variables, observations, and datatypes?

Mar 04 '22 11:03 Jongmassey

documentation documentation copied to clipboard

Document efficient analysis of large datasets

documentation
documentation copied to clipboard