documentation
documentation copied to clipboard
Document efficient analysis of large datasets
Certain study definitions can result in very large input datasets for analytical code (e.g. establishing baseline values for entire populations), especially when repeated cohort extractions are performed (e.g. monthly over a period of multiple years).
Naïve approaches to analysis of these large datasets can result in excessive RAM usage, to the detriment of successful job completion and general platform availability.
We should develop a set of general principles and recommendations for analysis of large datasets e.g.
- input minimisation
- correct datatyping
- dataframe lifecycles
And perhaps a tool for crude estimation of dataset size based on number of variables, observations, and datatypes?