mars icon indicating copy to clipboard operation
mars copied to clipboard

Optimize SubtaskGraph generation

Open zhongchun opened this issue 2 years ago • 0 comments

What do these changes do?

In gen_subtask_graph, Mars always create new out chunks even if the out chunk already exists. It costs a lot of time if there are plenty of chunks.

Related issue number

Fixes #3341

I did a comparison, in which one creates new out chunks and the other does not. The test scripts are:

import mars.tensor as mt
import mars.dataframe as md

size = 50000
da1 = mt.random.random((size, 2), chunk_size=(1, 2))
df1 = md.DataFrame(da1, columns=list("AB"))
df2 = df1 + 10
df3 = df2.sum()
ret = df3.execute()

Cost time of Subtask generation are: 122.92s, 56.63s.

Check code requirements

  • [ ] tests added / passed (if needed)
  • [ ] Ensure all linting tests pass, see here for how to run them

zhongchun avatar Apr 21 '23 09:04 zhongchun