PyPortfolioOpt icon indicating copy to clipboard operation
PyPortfolioOpt copied to clipboard

Feature request: Enable Spark Support

Open ProbStub opened this issue 2 years ago • 7 comments

Is your feature request related to a problem? I like to perform large portfolio/a large number of asset optimizations, e.g. more than 1m portfolios and use Apache Spark rather than Pandas/Numpy.

Describe the feature you'd like Large asset covariance computations should leverage parallel structures and portfolio optimization should be possible to scale to a larger number of portfolios. Similar to the mocked up implementation on this fork.

Additional context Hi,

I am currently thinking about using pypfopt in a spark environment. I have mocked up my idea on a fork here - it's work in progress. API and other changes are purely for demonstration purpose and clearly would need much more discussion if a PR were to be considered - also spark 3.2’s pyspark.pandas will simplify the effort.

Before going ahead further I would have two questions:

  1. Is spark support generally desired for pypfopt?
  2. If yes, should that be an extension of the API as in the mock, or should it be a more stand-alone kind of implementation?

Cheers and happily looking forward to your thoughts.

Best regards, Prob

ProbStub avatar Oct 14 '21 16:10 ProbStub

@ProbStub Very cool feature request!

You mention scaling across many portfolios and assets, which to me are two quite different problems. The former can anyways be easily parallelized in most cases. For the latter, I'm really curious to hear what use cases you have in mind that require e.g. spark to compute.

phschiele avatar Oct 14 '21 17:10 phschiele

@phschiele Yes two very different use cases. A large number of portfolios may result from the number of users and would, on its own not be a challange. However, the system would work with look-through ETF decomposition and re-compose the portfolios given a selection of metrics. An ETF portfolio may contain several positions of different share classes and even after aggregating up to the same securities the number of positions may be quite large. Hope that adds a little color.

ProbStub avatar Oct 14 '21 19:10 ProbStub

Gotta be honest, this is a bit beyond me! I've never had to deal with portfolios of more than a few hundred assets.

robertmartin8 avatar Oct 19 '21 19:10 robertmartin8

In my experience, the optimization usually would become a bottleneck first.

With that said, it's cool that the koalas project will directly be included into PySpark, thanks for sharing that!

phschiele avatar Oct 19 '21 19:10 phschiele

@robertmartin8 It is admittedly more an issue with fixed income ETFs but some large equity ETFs (e.g. VT) have a huge number of holdings. Adding alternatives and hard to value assets the numbers increase further.

If I were to extend pypfopt with PySpark capabilities, would you prefere that to be: a) totally seperate from the existing methods and classes (which would allow for a much better use of SparkML pipelines)? b) an extension to become an option of the existing API, much like in the mock-up fork, (allowing for better adoption of alternative compute backend for users)? c) placed into a different package so that these features do not affect pypfopt design going forward (which would allow more degrees of freedom, given Spark has it's own wrapper for BLAS)?

I am more than happy to work on this quietly and propose a PR in a few weeks time. It would help to know how you would prefere to do such an extension, if at all.

@phschiele: The optimization bottleneck is certainly a challange. That problem may only arrise when and if computations become possible. Additionally, one may want to actually reduce target holdings to a lot less than the original 1000+ positions as part of the optimization or using other techniques.

ProbStub avatar Oct 20 '21 12:10 ProbStub

@ProbStub did you go any further with this?

blair-anson avatar Jun 24 '22 07:06 blair-anson

@blair-anson I have had to stop working on this due to other commitments and the limited interest. Happy to pick it up again. There have been a few Spark releases since, so chances are this might be easier now. Back then then pandas cov function was still missing on the Pandas API for Spark.

ProbStub avatar Jun 24 '22 07:06 ProbStub