Redefining model parameter `compute`
Right now, setting compute=True triggers a sequential computation of the following steps:
- the preprocessor scaler
- optional NaN checks
- SVD
- scores and components
I’m starting to doubt the usefulness of the compute parameter in its current form. When @slevang managed to implement fully lazy evaluation, I thought it made sense to keep this parameter, but now I'm not so sure. My experience has been that it’s almost always faster to let Dask handle everything and find the most efficient way to compute, rather than forcing intermediate steps with compute=True. I plan to run some benchmarks when I get a chance to confirm this.
The only real benefit I see for this parameter might be in cross-decomposition models (like CCA, MCA, etc.), where PCA preprocessing is done on individual datasets to make subsequent steps more computationally feasible. Often, this results in a reduced PC space that fits into memory, so continuing with eager computation for the SVD, components, scores, etc., makes sense. Another advantage could be when defining the number of PCs based on explained variance rather than a fixed number—this requires computing individual PCA models upfront.
So, I’m thinking it might be more practical to redefine the compute parameter as something like compute_preprocessor. This would compute the following in one go, using Dask’s efficiency:
- the preprocessor scaler
- optional NaN checks
- PCA preprocessing
From there, we can continue with eager computation for the SVD, scores, and components.
I’d love to hear your thoughts on this @slevang since you have probably much more experience with Dask’s magic! :)
One example I'm remembering where this does have a major impact is in the rotator classes. The design where we fit these iteratively with a tolerance check simply can't be done in a single dask graph, so I had to drop that and just run a fixed number of iterations.
The other piece that couldn't be done lazily was sorting the modes by variance explained, because that would require reindexing by a dask array which isn't currently allowed. I worked around that though by manually redefining the compute method on the model classes and bundling the sorting operation in at the end.
It would be interesting to do more benchmarking on compute time and memory footprint of adding some intermediate compute/persist steps to on the preprocessor factors like you mention. Although if we saw major improvements there, we could also try inserting things like dask.graph_manipulation.wait_on in strategic locations, which essentially does the same thing as specifying something like compute_preprocessor but allows you to still build the entire graph lazily.
Oh right, I completely forgot about the rotator classes. Thanks for pointing me to dask.graph_manipulation.wait_on - I wasn’t aware of that one. Do you think this approach could help build the entire graph for the PCA Preprocessor lazily, where the number of PCs to retain (and thus the shape of the output matrices) is determined by the data instead of being a fixed parameter?
To be more specific, I recently adapted (#184) the Decomposer to perform a truncated SVD based on a specified amount of explained variance. Right now, I just prohibit users from using variance-based truncation with Dask. Do you think it’s possible to adapt the code with wait_on so that the model can still be built lazily even when it’s based on variance?
I don't think so. All wait_on does is tell dask when it goes to order and execute the task graph that all tasks up to that point need to run prior to anything else downstream. You still aren't able to build indexing operations with dask arrays. Indexing with an array of unknown size would be very hard to ever implement in the dask framework, while I believe indexing with an array of known size but unknown values is maybe possible but not implemented.
Since this is a truncation, and we're running the larger computation anyways, I wonder if we could make this feature dask enabled by using the model._post_compute() trick I used for the variance sorting. That seems like a natural fit. Although in some of the more complex cross models maybe this gets messy?
On larger datasets in practice I've seen very little difference in runtime with dask.svd_compressed between computing say 10 modes vs 100 modes. So I default to some larger number and then subset down from there as needed once the decomposition is run.
I admit that this is a bit beyond my programming brain right now, and I can't really see how to apply model._post_compute() to the cross models (maybe I've just been staring at the screen for too long today 😅). I hope to send a PR today or tomorrow with a slightly cleaner implementation of the BaseCrossModel class - maybe that will help too.