dask-examples icon indicating copy to clipboard operation
dask-examples copied to clipboard

Added array optimimzation fuse notebook

Open TomAugspurger opened this issue 5 years ago • 6 comments

From https://github.com/dask/dask/issues/5105.

https://mybinder.org/v2/gh/TomAugspurger/dask-examples/array-fuse (building an image now)

TomAugspurger avatar Jul 18 '19 14:07 TomAugspurger

Thanks @alimanfoo, I've applied your suggestions.

@mrocklin do you have high-level thoughts on this? Does this feel like we're just documenting a workaround to a weakness of Dask that we should instead be fixing?

TomAugspurger avatar Jul 19 '19 20:07 TomAugspurger

Yes, to me this notebook seems perhaps overly-specific to a single use case. I'm having trouble finding ways to generalize this notebook to other situations. I think that a general example of optimization would be useful. There are plenty of cases where this comes up, such as in ML workloads where you really want X and y to be co-allocated. That case might also be a bit simpler.

On Fri, Jul 19, 2019 at 1:40 PM Tom Augspurger [email protected] wrote:

Thanks @alimanfoo https://github.com/alimanfoo, I've applied your suggestions.

@mrocklin https://github.com/mrocklin do you have high-level thoughts on this? Does this feel like we're just documenting a workaround to a weakness of Dask that we should instead be fixing?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask-examples/pull/89?email_source=notifications&email_token=AACKZTE2TXB5RBUDJP3TPBTQAIRCRA5CNFSM4IE3TV3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2MWRHA#issuecomment-513370268, or mute the thread https://github.com/notifications/unsubscribe-auth/AACKZTHN7P4E3BPIS3OSTWLQAIRCRANCNFSM4IE3TV3A .

mrocklin avatar Jul 19 '19 22:07 mrocklin

Although in general in many of these cases I think that we can improve them just by expanding Blockwise and HighLevelGraph operator fusion out to data access operations

On Fri, Jul 19, 2019 at 3:15 PM Matthew Rocklin [email protected] wrote:

Yes, to me this notebook seems perhaps overly-specific to a single use case. I'm having trouble finding ways to generalize this notebook to other situations. I think that a general example of optimization would be useful. There are plenty of cases where this comes up, such as in ML workloads where you really want X and y to be co-allocated. That case might also be a bit simpler.

On Fri, Jul 19, 2019 at 1:40 PM Tom Augspurger [email protected] wrote:

Thanks @alimanfoo https://github.com/alimanfoo, I've applied your suggestions.

@mrocklin https://github.com/mrocklin do you have high-level thoughts on this? Does this feel like we're just documenting a workaround to a weakness of Dask that we should instead be fixing?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask-examples/pull/89?email_source=notifications&email_token=AACKZTE2TXB5RBUDJP3TPBTQAIRCRA5CNFSM4IE3TV3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2MWRHA#issuecomment-513370268, or mute the thread https://github.com/notifications/unsubscribe-auth/AACKZTHN7P4E3BPIS3OSTWLQAIRCRANCNFSM4IE3TV3A .

mrocklin avatar Jul 19 '19 22:07 mrocklin

@TomAugspurger , did you have plans to try to make the story here more general?

martindurant avatar Jul 31 '19 19:07 martindurant

Not at the moment.

On Wed, Jul 31, 2019 at 2:00 PM Martin Durant [email protected] wrote:

@TomAugspurger https://github.com/TomAugspurger , did you have plans to try to make the story here more general?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask-examples/pull/89?email_source=notifications&email_token=AAKAOIXRNST4KKWHAAEYPJTQCHOLXA5CNFSM4IE3TV3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3IHNFQ#issuecomment-516978326, or mute the thread https://github.com/notifications/unsubscribe-auth/AAKAOITRDXUXDIRQKX5NCDDQCHOLXANCNFSM4IE3TV3A .

TomAugspurger avatar Jul 31 '19 19:07 TomAugspurger

@mrocklin question on the HLG fusion: would you expect adding additional operations to the end of a task graph (e.g. .store) to potentially result in more fusion earlier on? My guess is that extra tasks won't lead to more fusion earlier on, but I may be misreading fuse.

I ask because when I look at just the creation / stacking / rechunking, we don't get fusion with the default parameters:

import dask.array as da

inputs = [da.random.random(size=500_000, chunks=90_000)
          for _ in range(5)]
inputs_stacked = da.vstack(inputs)
inputs_rechunked = inputs_stacked.rechunk((50, 90_000))
inputs_rechunked.visualize(optimize_graph=True)

image

So unless adding a .store() to the end results in more fusion earlier on (in the creation / stacking / rechunking phase), we won't be solving this use-case.

TomAugspurger avatar Aug 01 '19 15:08 TomAugspurger