LightGBM
LightGBM copied to clipboard
[Question] LightGBM performance scaling with RAM speeds
Hi there, more of a question than an issue really, please let me know if this is not the appropriate place for it, I've found no better spot but if there is one I'll be glad to take it there.
Short version: does anyone know if the training time of CPU implementations of tabular learning algorithms (LightGBM in particular but also XGBoost, TabNet, etc) depend on RAM speeds?
Longer version. I recently switched from an i7 12700KF CPU to an i9 13900k. Doing a somewhat heavy AutoGluon training (most time spent in the algorithms above) that takes 4 hours I got a 1.6X speedup from the newer processor which is great, training now takes 2.5 hours so more trials per day of work). My RAM is a 2x32GB kit of DDR4 memory that can work overclocked at 3200MHz (CL18). However while installing the new CPU it defaulted back to 2133MHz. At that speed, training was far slower, I don't recall the exact figure but something like 50% as fast. After overclocking to 3200MHz, the 1.6X speedup.
There's thousands of RAM benchmarks for games (where RAM speeds have a limited impact) but I've found none for ML. Closest I got was this video from LTT https://www.youtube.com/watch?v=b-WFetQjifc where he shows for some productivity apps it has a major impact but none of those are ML applications.
So my question is: are these algorithms training times sensitive to RAM bandwidth? More so for CPUs with higher core counts?
yeah, the training is sensitive to the RAM speed. The core algorithm (histogram accumulation) in LightGBM is quite simple and just contains additions. The bottleneck is indeed the memory speed.
this is a very interesting discussion! I also have a hard time finding benchmarks for ML. My question is: is there a big difference in case of jumping from ddr4 to the new ddr5 ram?
Hey, thank you so much for answering and for those insights. I've picked up a DDR5@6600MHz kit which should be arriving in some days. I'll try and put together a simple benchmark for trying the algorithms and test them with my current DDR4, without the XMP@2133MHz, with the XMP@3200MHz, and then with the DDR5 without XMP@4800MHz I think, and with XMP@6600MHz. That should give us a few datapoints to understand the scaling. I'll let you know when I have those ready!
Hey, thank you so much for answering and for those insights. I've picked up a DDR5@6600MHz kit which should be arriving in some days. I'll try and put together a simple benchmark for trying the algorithms and test them with my current DDR4, without the XMP@2133MHz, with the XMP@3200MHz, and then with the DDR5 without XMP@4800MHz I think, and with XMP@6600MHz. That should give us a few datapoints to understand the scaling. I'll let you know when I have those ready!
Thank you so much!
Hey guys, I still don't have the RAMs with me yet but I've put together this repo for the benchmark. https://github.com/Ludecan/ml_benchmark It's very very much a wip and definitely has a lot to improve but should work as a basis. Comments, issues and PRs welcome!
ping... have you any update news? where we can see the results of your benchmark?
Hey @celestinoxp, yup, I have some raw numbers. I also have the DDR5 Kit and motherboard with me but haven't gotten to installing them yet, likely this weekend or the next.
What I did get though are the results for running DDR4@2133 MHz and @3200 MHz. And with this small change there is a difference yes, much more significant for larger dataset sizes. Starting at 1M rows, you can see speedups between 10% and almost 30% for most models just for this jump. For up to 100K rows differences are much smaller if any though. In my particular application (a Market Simulation you can read about here if you're interested, it's a lot of very small models but trained many many times with bootstrapping) it made a 5% difference overall.
I haven't had time to properly format them but you can find the raw results so far here
PS: I'm having some problem with the Tabnet and TFDF models. Their results are way off.
Hi @Ludecan , some ideas:
- the calculation sheet does not show which cpu it is, maybe it is important to know the number of cores it has and frequency in mhz.
- maybe you can check the difference in a dataset with more columns, I think that 1000 x 1000 would be enough to take the processor to the limit and the memories too....
I'm very excited to see the results of the ddr5...
Thank you for your work and your time
Hey @celestinoxp, yeah, I intend to share the full system info in the final results. The CPU is an i9 13900K with 24 cores and 32 threads, so, rather thread heavy. That might affect results too, I'm not sure how much number of cores and RAM bandwidth go hand in hand but in principle they should grow together. If you have a different CPU you can try the benchmark yourself, it should be relatively simple to install. I'll add the 1000 columns into the mix see how it goes. I wanted to go to 10M rows, but on my 64GB computer one of AutoGluon's models ran out of memory. I think I'll disable AutoGluon and Tabnet for 10M rows which are the highest RAM users and leave it on for the rest. Also the faster measurements are all over the place. I should measure several times and take the median there. Work work...
@Ludecan excelent work! congratulations! :)
Hey @celestinoxp , I've installed the DDR5 RAMs and made the changes from the previous post to the code. It's currently running the DDR5@6600MHz test but I'm seeing ~10% more performance for the 100K row dataset compared to DDR4@3200MHz and bigger differences for bigger datasets. So smaller datasets are yielding positive results now, and bigger datasets are yielding even bigger improvements. Will have the final numbers ready in a couple of days.
Hey there @celestinoxp, @guolinke, I have the results with me.
This is a HWInfo screenshot with the system info (there's more pngs for the different RAM speeds, the 3200MHz one is identical to the 2133 one but I forgot to save the pic).
These are the Speedups vs DDR4@2133MHz for LightGBM
Larger problems see some massive gains. The 1000000 x 100 one saw a 219% speedup, more than twice as fast. For reference, moving from the an i7 12700k to an i9 13900k gave me a 50% speedup in a real world AutoGluon problem, so, 219% is quite massive.
For smaller problem sizes there's no real difference though. I guess RAM speeds start to matter, once we saturate CPU caches.
AutoGluon in particular, which was the longest running of all algorithms, does see big gains as well, and starting from smaller but still relatively big problem sizes (500000 x 10).
The trend is more or less similar across all algorithms save XGBoost (if not using the histogram method) which sees no gains at all across RAM speeds.
You can find results for all models in this spreadsheet. Change the model filter in the Speedups sheet to get the graphics for the different models. The linear regression timings are not super accurate because they take too little time to execute so don't pay much attention to them.
I plan to add some discussion section around the results in the repo itself here.
And that's that. I guess the final answer to the question of whether investing in faster RAM depends on how big your problem is. For more than 1M rows, I'd say yes, it's a no brainer, for 500k it may be worth it, for less than that, it's not. Also, these results are valid for the 13900k with 32 threads. I don't know how well these would carry out to other CPUs, but anyone is welcome to test out with the benchmark.
@Ludecan excelent work, congratulations! one more question: since we have the possibility of comparison, I would like to see something more specific in my case, which is the problem of having many columns, because in my case I use machine learning for regression and classification with pycaret (it is used less code), but I need to use polynomial_features, so it greatly increases the number of columns in a dataset. Polynomial in some cases can be the solution to a problem, but the number of columns can be exponential.
Might consider doing a test with many columns? ex. 1000. 2000. etc...? even with less lines... 1000 x 1000 (for example)
Well, I actually tried adding more columns to the set, but started running out of memory to run them. The smaller problems would run but the larger ones would take a few hours and crash, so I had to cut them short or I wouldn't have been able to finish the tests in time. I no longer have the DDR4 RAM with me so I won't be able to test in the slower RAM speeds.
One thing that seems a common trend as well is that there seems to be an acceleration in speedup with the number of columns. Ie going from 50 to 100 columns speeds up more than going from 10 to 50 in all the large problems.
The 1000 rows problem here seems to already show some speedup here, so combining these two things it's likely you'll see a performance advantage?
Hey guys, well, if there's no further discussion I'm closing the issue. Thanks for your help and feel free to reach me in either the repo or via email if there's anything else to discuss.
I created 3 issues in your repository a long time ago and they are unanswered, lol https://github.com/Ludecan/ml_benchmark/issues
Thanks so much @Ludecan ! We'd love to have you come contribute here if you ever have the time / interest.
This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.