mars icon indicating copy to clipboard operation
mars copied to clipboard

Improve performance of building expr, subtask graph etc

Open qinxuye opened this issue 2 years ago • 0 comments

Motivation

Now, when tileables have large number of chunks, building expr, subtask graph and so forth could become the bottleneck, we need to try to reduce the overhead.

Solutions

Support generating dtypes, index_value, columns_value lazily for DataFrame chunks

Now, generating meta of DataFrame chunks will occupy quite a non-negligible time, since the tileable have some important meta like dtypes, nsplits, index_value etc, we can actually calculate the chunk's meta, we don't have to generate the chunk meta intermediately unless we need them.

So we can add support for generating those meta lazily unless they are used. This job is POC in #2756 .

Stop including dtypes, index_value etc in chunks of subtasks

When subtask graph is generated, meta of chunks will be include, we can skip those parts in subtask graph, hence the cost of serialization will be expected to reduce. However, for now, not a large number of ops are using these meta in their execution process, so we may need some work to correct them.

Reduce cost of updating chunk meta

Updating chunk meta may be another bottleneck that can be reduced.

Work items

  • [x] Introduce asv benchmarks(#2798)
  • [x] Support generating dtypes, index_value, columns_value lazily for DataFrame chunks(#2756)
  • [x] ~~Stop including dtypes, index_value etc in chunks of subtasks(#2923)~~
  • [x] Reduce cost of updating chunk meta(#2912)

qinxuye avatar Mar 07 '22 06:03 qinxuye