BladeDISC icon indicating copy to clipboard operation
BladeDISC copied to clipboard

Support for `tf.feature_column`s in disc

Open Orion34-lanbo opened this issue 2 years ago • 6 comments

Why we want to support tf.feature_columns

We have found plenty of models with tf.feature_column ops in many industrial models, such as in CTR models. Normally the tf.feature_column ops are not computation intensive but are memory intensive and also consume a lot cpu resources due to the runtime costs brought by these kind of ops.

tf.feature_columns are observed to take a large portion of e2e time and cpu resources. Currently we have a pattern-match based impl with custom passed and ops to optimize tf.feature_column ops. However these pattens needs to be updated frequently due to users all have their custom impl when using a combination of tf.feature_column ops. Thus supporting tf.feature_columns in compiler side will give us the ability to support a large variety of user's customized tf.feature_column combinations.

Long-time Road Map

  • Support lowering of all tf.feature_column related ops
  • Support clustering of all tf.feature_column related ops
  • Based on perf of codegen tf.feature_column related kernels, do some specific optimization with avx512 or so on.

Short-time Road Map

Support the follwing 2 feature colums first, then we 20220830: Since we have major changes for origin plans, disable some tasks for now.

- [ ] Supoort `bucketize_column`
- [ ] Support [indicator_column](https://www.tensorflow.org/api_docs/python/tf/feature_column/indicator_column)
- [ ] Support clustring for these 2 type of feature colums related ops
- [ ] Profiling to find out the potential perf issue

Orion34-lanbo avatar Jul 29 '22 08:07 Orion34-lanbo

Update 20220819: For SparseToDense op, we are able to lowering to mhlo operations. After lowering, we have found out that Disc does not support mhlo::ScatterOp's codegen. Adding full support for scatter will take a considerable amount of time, thus we will hold this action and continue with supporting tf.embedding_column.

Orion34-lanbo avatar Aug 19 '22 08:08 Orion34-lanbo

For Unique op in tf.embedding_column, we have noticed that grappler's ArithmeticOptimizer will do a simplification for unique+gather+sparsesegmentxxx pattern to sparsesegmentxxx. We need to make sure whether we need to support Unique for tf.embedding_column.

Orion34-lanbo avatar Aug 19 '22 08:08 Orion34-lanbo

Updates 22020830: We have decided to support lowering for tf.embedding_column related ops by directly emit code for the following ops to make DISC support tf.embedding_column

  • tf.SparseReshape
  • tf.SparseFillEmptyRows
  • tf.SparseSegmentMean
  • tf.Where

Following pengzhan's origin poc impl, we will directly lowering the 4 ops to corresponding ops as in mhlo_disc and lmhlo_disc Dialect, and will do codegen for these ops in DiscLhloLegalizeRootsToParallelLoops pass.

Latest TODO list for supporting tf.embedding_column on X86 device:

  • [x] codegen support for tf.SparseReshape(Still need a patch to support -1 in new_shape )
  • [x] codegen support fortf.SparseFillEmptyRows
  • [x] codegen support fortf.SparseSegmentMean(Need to do some perf improvement)
  • [x] codegen support fortf.Where
  • [x] Support clustering for these 4 ops
  • [ ] Benchmark for popular models with feature columns
  • [ ] support multi-threading code generation for these ops
  • [ ] support fusion for these ops

Orion34-lanbo avatar Aug 30 '22 06:08 Orion34-lanbo

Updates 22020830: We have decided to support lowering for tf.embedding_column related ops by directly emit code for the following ops to make DISC support tf.embedding_column

  • tf.SparseReshape
  • tf.SparseFillEmptyRows
  • tf.SparseSegmentMean
  • tf.Where

Following pengzhan's origin poc impl, we will directly lowering the 4 ops to corresponding ops as in mhlo_disc and lmhlo_disc Dialect, and will do codegen for these ops in DiscLhloLegalizeRootsToParallelLoops pass.

Latest TODO list for supporting tf.embedding_column on X86 device:

  • [x] codegen support for tf.SparseReshape(Still need a patch to support -1 in new_shape )
  • [x] codegen support fortf.SparseFillEmptyRows
  • [x] codegen support fortf.SparseSegmentMean(Need to do some perf improvement)
  • [x] codegen support fortf.Where
  • [x] Support clustering for these 4 ops
  • [ ] Benchmark for popular models with feature columns
  • [ ] support multi-threading code generation for these ops
  • [ ] support fusion for these ops

Update 20221205: After initial tests on EasyRec models, we have out-perform tensorflow baseline, however we compare codegen perf with hand-writing kernels, the gaps is inevitable. Thus, we need to update the ongoing items as follows:

  • [x] replace scf.for with scf.parallel for current codegen impl
  • [x] support input fusion with element-wise ops with mhlo_disc.where
  • [x] support output fusion for mhlo_disc.where
  • [x] simplify mhlo.sparse_reshape, maybe work as normal reshape op
  • [ ] use avx256 or avx512 to speed up kernel perf

Orion34-lanbo avatar Dec 05 '22 09:12 Orion34-lanbo

Update 20221223: Recently, we have done a poc for output fusion for lmhlo_disc.where op, the entire poc consist of several part of code change that we intend to split into the following for commit to main branch.

  • [ ] sparse op rewrite pass part 1: eliminate unnecessary mhlo.sparse_reshape for 2d sparse tensor reshape
  • [ ] sparse op rewirte pass part 2: rewrite mhlo.real_dynamic_slice + mhlo.gather to mhlo.gather + mhlo.real_dynamic_slice for mhlo.real_dynamic_slicefromwhere` op
  • [ ] refactor for sparse op fusion to make a SparseOpCpuFusionStrategy to support output fusion
  • [ ] output inline fusion pass part 1: inline fuse lmhlo.dynamic_reshape
  • [ ] output inline fusion pass part 2: inline fusse lmhlo.dynamic_gather
  • [ ] other small changes to support e2e flow of lmhlo_disc.where op's output fusion

Orion34-lanbo avatar Dec 23 '22 06:12 Orion34-lanbo

Update 20230302: Output fusion for lmhlo_disc.where did not bring enough perf enhancement on EasyRec model. After profiling and detailed analysis, we have found out that lmhlo_disc.sparse_segment_reduction carry lots of computation for embedding_column. We have done series of optimization since then.

latest perf result

perf is tested on Bare Metal Server used only by myself, with the following 128 * Intel(R) Xeon(R) Platinum 8369B CPU @ 2.90GHz

opt latency(ms) speed-up
baseline 9.77 -
hand-write-fusion-opt 6.79 1.43x
disc 8.05 1.21x

We have achived a 1.21x speed-up for tf's baseline, however we still have a 18.5% gap with hand-write-fusion-opt.

optimization pocs

  • sparse segment reduction refactor
  • sparse_segment_reduction + sparse_fill_empty_rows rewrite to sparse_segment_reduction_with_empty_rows(only works for inference case)
  • output fusion for sparse_segment_reduction_with_empty_rows with fusion type kSparseReduction
  • schedule optimization for sparse_segment_reduction for possible output fusions
  • output inline fusion for kSparseReduction

items for code merge pr

  • [x] sparse fusion refactor
  • [x] add mhlo_disc.sparse_segment_reduction for support both tf.sparse_segment_mean and tf.sparse_segment_sum
  • [x] sparse reshape elimination using pdll
  • [x] sparse_segment_reduction + sparse_fill_empty_rows rewrite to sparse_segment_reduction_with_empty_rows using pdll
  • [x] codegen logic for lmhlo_disc.sparse_segment_reduction_with_empty_rows as root node
  • [ ] output fusion for kSparseReduction
  • [ ] basic framework for output inline fusion with utils: - collectStoreToSameMemref - getStorableOperation - inlineFuseLhloOp
  • [ ] output inline fusion support for lmhlo.dynamic_reshape
  • [ ] output inline fusion support for lmhlo.dynamic_broadcast_in_dim
  • [ ] output inline fusion support for lmhlo.select

Orion34-lanbo avatar Mar 02 '23 09:03 Orion34-lanbo