mars
mars copied to clipboard
Optimize SubtaskGraph generation
What do these changes do?
In gen_subtask_graph, Mars always create new out chunks even if the out chunk already exists. It costs a lot of time if there are plenty of chunks.
Related issue number
Fixes #3341
I did a comparison, in which one creates new out chunks and the other does not. The test scripts are:
import mars.tensor as mt
import mars.dataframe as md
size = 50000
da1 = mt.random.random((size, 2), chunk_size=(1, 2))
df1 = md.DataFrame(da1, columns=list("AB"))
df2 = df1 + 10
df3 = df2.sum()
ret = df3.execute()
Cost time of Subtask generation are: 122.92s, 56.63s.
Check code requirements
- [ ] tests added / passed (if needed)
- [ ] Ensure all linting tests pass, see here for how to run them