dask-examples
dask-examples copied to clipboard
Added array optimimzation fuse notebook
From https://github.com/dask/dask/issues/5105.
https://mybinder.org/v2/gh/TomAugspurger/dask-examples/array-fuse (building an image now)
Thanks @alimanfoo, I've applied your suggestions.
@mrocklin do you have high-level thoughts on this? Does this feel like we're just documenting a workaround to a weakness of Dask that we should instead be fixing?
Yes, to me this notebook seems perhaps overly-specific to a single use case. I'm having trouble finding ways to generalize this notebook to other situations. I think that a general example of optimization would be useful. There are plenty of cases where this comes up, such as in ML workloads where you really want X and y to be co-allocated. That case might also be a bit simpler.
On Fri, Jul 19, 2019 at 1:40 PM Tom Augspurger [email protected] wrote:
Thanks @alimanfoo https://github.com/alimanfoo, I've applied your suggestions.
@mrocklin https://github.com/mrocklin do you have high-level thoughts on this? Does this feel like we're just documenting a workaround to a weakness of Dask that we should instead be fixing?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask-examples/pull/89?email_source=notifications&email_token=AACKZTE2TXB5RBUDJP3TPBTQAIRCRA5CNFSM4IE3TV3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2MWRHA#issuecomment-513370268, or mute the thread https://github.com/notifications/unsubscribe-auth/AACKZTHN7P4E3BPIS3OSTWLQAIRCRANCNFSM4IE3TV3A .
Although in general in many of these cases I think that we can improve them just by expanding Blockwise and HighLevelGraph operator fusion out to data access operations
On Fri, Jul 19, 2019 at 3:15 PM Matthew Rocklin [email protected] wrote:
Yes, to me this notebook seems perhaps overly-specific to a single use case. I'm having trouble finding ways to generalize this notebook to other situations. I think that a general example of optimization would be useful. There are plenty of cases where this comes up, such as in ML workloads where you really want X and y to be co-allocated. That case might also be a bit simpler.
On Fri, Jul 19, 2019 at 1:40 PM Tom Augspurger [email protected] wrote:
Thanks @alimanfoo https://github.com/alimanfoo, I've applied your suggestions.
@mrocklin https://github.com/mrocklin do you have high-level thoughts on this? Does this feel like we're just documenting a workaround to a weakness of Dask that we should instead be fixing?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask-examples/pull/89?email_source=notifications&email_token=AACKZTE2TXB5RBUDJP3TPBTQAIRCRA5CNFSM4IE3TV3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2MWRHA#issuecomment-513370268, or mute the thread https://github.com/notifications/unsubscribe-auth/AACKZTHN7P4E3BPIS3OSTWLQAIRCRANCNFSM4IE3TV3A .
@TomAugspurger , did you have plans to try to make the story here more general?
Not at the moment.
On Wed, Jul 31, 2019 at 2:00 PM Martin Durant [email protected] wrote:
@TomAugspurger https://github.com/TomAugspurger , did you have plans to try to make the story here more general?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask-examples/pull/89?email_source=notifications&email_token=AAKAOIXRNST4KKWHAAEYPJTQCHOLXA5CNFSM4IE3TV3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3IHNFQ#issuecomment-516978326, or mute the thread https://github.com/notifications/unsubscribe-auth/AAKAOITRDXUXDIRQKX5NCDDQCHOLXANCNFSM4IE3TV3A .
@mrocklin question on the HLG fusion: would you expect adding additional
operations to the end of a task graph (e.g. .store
) to potentially result in
more fusion earlier on? My guess is that extra tasks won't lead to more fusion
earlier on, but I may be misreading fuse
.
I ask because when I look at just the creation / stacking / rechunking, we don't get fusion with the default parameters:
import dask.array as da
inputs = [da.random.random(size=500_000, chunks=90_000)
for _ in range(5)]
inputs_stacked = da.vstack(inputs)
inputs_rechunked = inputs_stacked.rechunk((50, 90_000))
inputs_rechunked.visualize(optimize_graph=True)
So unless adding a .store()
to the end results in more fusion earlier on (in
the creation / stacking / rechunking phase), we won't be solving this use-case.