model-res-avm
model-res-avm copied to clipboard
Improvements: low hanging fruits
Hi there, saw the talk at Rconf. Looks like you've done a lot of work here. It looks solid, though there are many improvements that can be made. I have some suggestions that are low hanging fruits after a quick glance at the repo I think should be seriously considered.
I will start with a bit of a sanity check -- in the README, it says:
The total amount of RAM needed for overall model fitting is around 6GB
Is 6GB RAM a typo? In the talk, Dan says that this model takes a few days, but 6GB is below-average (some might argue below the half of average) of today's $500 consumer-grade or home laptop. It was confusing on why AWS would be necessary or why it would take a few days with LGBM. I saw the 5 download files and total combined was well below 1GB, though joined data is probably different story, which makes 6GB RAM sound about right for modelling (which makes it puzzling). For handling $16B in assets, getting a GPU + some RAM at the cost of $400 to $1,500 GPU are virtually nothing in comparison and serves the public much better. Some clarification on this would be appreciated.
So this leads me to the low hanging fruit: better code/lib choices.
I see a lot of dplyr
functions, magrittr
and %>%
instead of |>
, etc. Switching to tidytable
would also be an extremely low hanging fruit while keeping 99% of the dplyr
codebase. You can remove all dplyr::
, tidyr::
, tidyselect::
, purrr::
(I might be missing a few more), add library(tidytable); conflict_prefer_all(winner = "tidytable", quiet = TRUE)
, and the 1% change needed would be 12 places where you use group_by
. Here is a demo of the very simple changes:
# dplyr
df |>
group_by(column_group) |>
mutate(column_new = column_old + 1) |>
ungroup()
# tidytable
df |>
mutate(.by = column_group,
column_new = column_old + 1)
And the performance gains would feel the same as switching from XGBoost to LGBM, while still not breaking 99% of the code.
Plus, though a very minor gain, |>
is faster than %>%
. This also removes the dependency on magrittr
library.
And lastly, base R is slower than C++ based libraries akin to tidytable
. Though this is not as low of a hanging fruit, code conversion from base R to tidytable
makes the code much easier to read and more collaborative and reduces human communication time.
And as a minor nitpick bonus, some of your code comments are 2x or 5x longer than the code itself. This shouldn't be necessary especially for tidy-piped syntax when the code is already so easy to read.
Hi @PathosEthosLogos! Thanks for the input. We really do appreciate anyone who takes the time to give us feedback. To answer your questions/suggestions:
Is 6GB RAM a typo? In the talk, Dan says that this model takes a few days, but 6GB is below-average (some might argue below the half of average) of today's $500 consumer-grade or home laptop. It was confusing on why AWS would be necessary or why it would take a few days with LGBM. I saw the 5 download files and total combined was well below 1GB, though joined data is probably different story, which makes 6GB RAM sound about right for modelling (which makes it puzzling). For handling $16B in assets, getting a GPU + some RAM at the cost of $400 to $1,500 GPU are virtually nothing in comparison and serves the public much better. Some clarification on this would be appreciated.
To clarify, it takes roughly 30 hours to run a full cross-validation cycle of the model that checks a huge amount of the hyperparameter space. Training a single LightGBM model using the finalized parameters and data takes <10 minutes.
We have some open issues (#36, #38) to reduce the CV time, but they aren't high priority. Boosted tree models tend to have a hyperparameter "plateau", i.e. a stable region of parameters where tuning further doesn't change performance much.
RE: Hardware and AWS. I did a whole dive benchmarking different models, hardware, and costs. The tl;dr is that it's cheapest and most efficient to simply provision AWS Batch jobs (with any hardware specs we want) to run our models. This lets us run many jobs in parallel and spin down the jobs when we aren't modeling (which is most of the year).
So this leads me to the low hanging fruit: better code/lib choices.
I'm a big data.table
booster myself (I recently tried a nativedata.table
rewrite of some stages of the pipeline), but the performance improvements (while substantial) aren't really worth the cost.
The vast majority of the pipeline's runtime comes from the model itself, or calculating feature importance, SHAP values, etc. Given the choice between refactoring for a marginal speedup of something that already works and trying new modeling methods/features/diagnostics, we're opting for the latter.
And as a minor nitpick bonus, some of your code comments are 2x or 5x longer than the code itself. This shouldn't be necessary especially for tidy-piped syntax when the code is already so easy to read.
This is a totally fair point 😄. At some point we need to go through and clean up the old comments.
It seems that there is a bit of misunderstanding. I will try to clarify. (seems that AWS + consumer grade GPU part isn't really a low hanging fruit anymore so I will skip, as well as LGBM as it requires a bit more work for saving model objects for production)
I want to emphasise the low hanging fruit of tidytable
(definitely not data.table
(!), which is the backend of tidytable
). The code changes will be minimal (link provided) and will work identically, except faster, so I think there is a wee bit of misunderstanding here about what you mentioned as costs. With conflicted
overrides to dplyr
functions (among few others) in tidymodels
, it should also speed up the modelling time at least a bit.
Ah, got it. Sure I'm happy to test out tidytable
in the next few weeks. I'd also be delighted to review a PR with your proposed changes :eyes:
@Damonamajor or @wagnerlmichael, one of you want to take a crack at this this upcoming week?
Looks like using tidytable
would require some significant refactoring to handle the complicated ifelse()
logic we have in the assess
stage. Right now, the speed benefit isn't really worth it, so I'm punting this issue to later this year.