vectorbt
vectorbt copied to clipboard
Use Dask dataframe to address simulations using up all memory
Hi. Appreciate your work on this lib.
One thing I am noticing is that I have to constantly split up the workload of vectorbt into smaller pieces in order to use it in real world cases. I usually simulate on secondly data and this leads to simulations that run out of memory, even 128gb of RAM, pretty quickly. Have you considered implementing something like Dask dataframe support? I can already pass in the dataframe to vbt, but vbt will quickly convert the partitioned dataframe into a single object that again ends up being too large.
( https://docs.dask.org/en/latest/dataframe.html )
@adventdtr I'd love using Dask too but vectorbt doesn't rely on pandas, nor are vectorbt's calculations highly parallelizable (see my previous answer).
This is because of the time-series nature of financial and trading data, which makes future timestamps depend upon previous timestamps, so you can't divide your data by the index. You might be able to divide it by columns though (as you're probably doing it right now), but you have to split manually prior to the computation.
To combat memory issues, vectorbt does lazy broadcasting in most places - passing two arrays of shapes [1, 1000] and [1000, 1] won't broadcast and materialize into two arrays of shape [1000, 1000] but they're kept as-is and vectorbt knows exactly how to put puzzles together dynamically. But as soon as you try to do broadcasting by column labels, such as [1, 1000] and [1000, 20], the broadcasted arrays must materialize. The only arrays that always materialize are close and call_seq, so you can't save much here, since both are later used for assessing performance. Also, when you access performance metrics such as portfolio.returns, it caches all intermediate results by default, so make sure to disable caching in the settings. If you still run out of memory, consider porting your strategy to Portfolio.from_order_func and implement your own memory-friendly simulation. For example, instead of doing the simulation and then calling portfolio.returns() - which must traverse data several times to calculate cash, holdings, portfolio value, and finally returns -, you can pre-calculate them all at each timestamp and write to arrays in your order function without the need to construct a portfolio object.
Generally, vectorbt already implements plenty of tricks and compresses data as much as possible, so even the most demanding use cases can be rewritten in a way that makes them possible without parallelization/data split.
I see that makes sense. You're right in saying that I split up the simulations by columns first as that is the easiest way to reduce the memory footprint.