cumo
cumo copied to clipboard
[PLAN] Improve performance with dimension compaction and indexer
- Stop using ndloop and compute an operation with one CUDA kernel using indexer
- Compact dimension to make computation of indexer fast
Element-wise (binary ops) is already done at https://github.com/sonots/cumo/pull/64.
But, reduction and others such as store_from are not yet done.
Without this, cumo (and red-chainer) can not compete with cupy (and chainer)
Current performance comparison on k80 machine:
- chainer mnist: 5 sec / epoch
- red-chainer mnist: 13 sec / epoch