BladeDISC
BladeDISC copied to clipboard
Support for `tf.feature_column`s in disc
Why we want to support tf.feature_column
s
We have found plenty of models with tf.feature_column
ops in many industrial models, such as in CTR models. Normally the tf.feature_column
ops are not computation intensive
but are memory intensive
and also consume a lot cpu resources due to the runtime costs brought by these kind of ops.
tf.feature_column
s are observed to take a large portion of e2e time and cpu resources. Currently we have a pattern-match based impl with custom passed and ops to optimize tf.feature_column
ops. However these pattens needs to be updated frequently due to users all have their custom impl when using a combination of tf.feature_column
ops. Thus supporting tf.feature_column
s in compiler side will give us the ability to support a large variety of user's customized tf.feature_column
combinations.
Long-time Road Map
- Support lowering of all
tf.feature_column
related ops - Support clustering of all
tf.feature_column
related ops - Based on perf of codegen
tf.feature_column
related kernels, do some specific optimization withavx512
or so on.
Short-time Road Map
Support the follwing 2 feature colums first, then we 20220830: Since we have major changes for origin plans, disable some tasks for now.
- [ ] Supoort `bucketize_column`
- [ ] Support [indicator_column](https://www.tensorflow.org/api_docs/python/tf/feature_column/indicator_column)
- [ ] Support clustring for these 2 type of feature colums related ops
- [ ] Profiling to find out the potential perf issue
Update 20220819:
For SparseToDense
op, we are able to lowering to mhlo
operations. After lowering, we have found out that Disc
does not support mhlo::ScatterOp
's codegen. Adding full support for scatter
will take a considerable amount of time, thus we will hold this action and continue with supporting tf.embedding_column
.
For Unique
op in tf.embedding_column
, we have noticed that grappler's ArithmeticOptimizer
will do a simplification for unique+gather+sparsesegmentxxx
pattern to sparsesegmentxxx
. We need to make sure whether we need to support Unique
for tf.embedding_column
.
Updates 22020830:
We have decided to support lowering for tf.embedding_column
related ops by directly emit code for the following ops to make DISC support tf.embedding_column
-
tf.SparseReshape
-
tf.SparseFillEmptyRows
-
tf.SparseSegmentMean
-
tf.Where
Following pengzhan's origin poc impl, we will directly lowering the 4 ops to corresponding ops as in mhlo_disc
and lmhlo_disc
Dialect, and will do codegen for these ops in DiscLhloLegalizeRootsToParallelLoops
pass.
Latest TODO list for supporting tf.embedding_column
on X86 device:
- [x] codegen support for
tf.SparseReshape
(Still need a patch to support-1
innew_shape
) - [x] codegen support for
tf.SparseFillEmptyRows
- [x] codegen support for
tf.SparseSegmentMean
(Need to do some perf improvement) - [x] codegen support for
tf.Where
- [x] Support clustering for these 4 ops
- [ ] Benchmark for popular models with feature columns
- [ ] support multi-threading code generation for these ops
- [ ] support fusion for these ops
Updates 22020830: We have decided to support lowering for
tf.embedding_column
related ops by directly emit code for the following ops to make DISC supporttf.embedding_column
tf.SparseReshape
tf.SparseFillEmptyRows
tf.SparseSegmentMean
tf.Where
Following pengzhan's origin poc impl, we will directly lowering the 4 ops to corresponding ops as in
mhlo_disc
andlmhlo_disc
Dialect, and will do codegen for these ops inDiscLhloLegalizeRootsToParallelLoops
pass.Latest TODO list for supporting
tf.embedding_column
on X86 device:
- [x] codegen support for
tf.SparseReshape
(Still need a patch to support-1
innew_shape
)- [x] codegen support for
tf.SparseFillEmptyRows
- [x] codegen support for
tf.SparseSegmentMean
(Need to do some perf improvement)- [x] codegen support for
tf.Where
- [x] Support clustering for these 4 ops
- [ ] Benchmark for popular models with feature columns
- [ ] support multi-threading code generation for these ops
- [ ] support fusion for these ops
Update 20221205:
After initial tests on EasyRec
models, we have out-perform tensorflow baseline, however we compare codegen perf with hand-writing kernels, the gaps is inevitable. Thus, we need to update the ongoing items as follows:
- [x] replace
scf.for
withscf.parallel
for current codegen impl - [x] support input fusion with element-wise ops with
mhlo_disc.where
- [x] support output fusion for
mhlo_disc.where
- [x] simplify
mhlo.sparse_reshape
, maybe work as normalreshape
op - [ ] use
avx256
oravx512
to speed up kernel perf
Update 20221223:
Recently, we have done a poc for output fusion for lmhlo_disc.where
op, the entire poc consist of several part of code change that we intend to split into the following for commit to main branch.
- [ ] sparse op rewrite pass part 1: eliminate unnecessary
mhlo.sparse_reshape
for 2d sparse tensor reshape - [ ] sparse op rewirte pass part 2: rewrite
mhlo.real_dynamic_slice + mhlo.gather
tomhlo.gather + mhlo.real_dynamic_slice
for mhlo.real_dynamic_slicefrom
where` op - [ ] refactor for sparse op fusion to make a
SparseOpCpuFusionStrategy
to support output fusion - [ ] output inline fusion pass part 1: inline fuse
lmhlo.dynamic_reshape
- [ ] output inline fusion pass part 2: inline fusse
lmhlo.dynamic_gather
- [ ] other small changes to support e2e flow of
lmhlo_disc.where
op's output fusion
Update 20230302:
Output fusion for lmhlo_disc.where
did not bring enough perf enhancement on EasyRec
model. After profiling and detailed analysis, we have found out that lmhlo_disc.sparse_segment_reduction
carry lots of computation for embedding_column
. We have done series of optimization since then.
latest perf result
perf is tested on Bare Metal Server used only by myself, with the following 128 * Intel(R) Xeon(R) Platinum 8369B CPU @ 2.90GHz
opt | latency(ms) | speed-up |
---|---|---|
baseline | 9.77 | - |
hand-write-fusion-opt | 6.79 | 1.43x |
disc | 8.05 | 1.21x |
We have achived a 1.21x
speed-up for tf's baseline, however we still have a 18.5% gap with hand-write-fusion-opt
.
optimization pocs
- sparse segment reduction refactor
-
sparse_segment_reduction
+sparse_fill_empty_rows
rewrite tosparse_segment_reduction_with_empty_rows
(only works for inference case) - output fusion for
sparse_segment_reduction_with_empty_rows
with fusion typekSparseReduction
- schedule optimization for
sparse_segment_reduction
for possible output fusions - output inline fusion for
kSparseReduction
items for code merge pr
- [x] sparse fusion refactor
- [x] add
mhlo_disc.sparse_segment_reduction
for support bothtf.sparse_segment_mean
andtf.sparse_segment_sum
- [x] sparse reshape elimination using pdll
- [x]
sparse_segment_reduction
+sparse_fill_empty_rows
rewrite tosparse_segment_reduction_with_empty_rows
using pdll - [x] codegen logic for
lmhlo_disc.sparse_segment_reduction_with_empty_rows
as root node - [ ] output fusion for
kSparseReduction
- [ ] basic framework for output inline fusion with utils: - collectStoreToSameMemref - getStorableOperation - inlineFuseLhloOp
- [ ] output inline fusion support for
lmhlo.dynamic_reshape
- [ ] output inline fusion support for
lmhlo.dynamic_broadcast_in_dim
- [ ] output inline fusion support for
lmhlo.select