HeKa

Results 17 comments of HeKa

Hi @sivukhin, because of resource lock of TF, the MirroredStrategy for TFRA multi-table is not efficient. We recommend using Horovod for distributed training. https://github.com/tensorflow/recommenders-addons/blob/master/docs/api_docs/tfra/dynamic_embedding/keras/layers/HvdAllToAllEmbedding.md https://github.com/tensorflow/recommenders-addons/blob/6f7bbb86a03bf17ee7a8c4b8d36415a2ca1cf693/tensorflow_recommenders_addons/dynamic_embedding/python/keras/layers/embedding.py#L528 https://github.com/tensorflow/recommenders-addons/blob/master/demo/dynamic_embedding/movielens-1m-keras-with-horovod/movielens-1m-keras-with-horovod.py Or you could have...

Of course HvdAllToAllEmbedding supports training on CPU. I ran your code successfully with `CUDA_VISIBLE_DEVICES=-1 horovodrun -np 2 python hvd_two_tower_test.py`, which using both redis_creator and cuckoo_creator. Also, if the error that...

### Ring-AllReduce vs Parameter Server The lower communication time overhead of multi-worker strategy is based on synchronous training. If many CPU nodes are trained asynchronously with a small batch size,...

@sivukhin For now, it will continue to integrate and be compatible with the latest version of Tensorflow, but this is a lot of work. So it would be great if...

> Try this: https://github.com/tensorflow/recommenders-addons/blob/master/docs/api_docs/tfra/dynamic_embedding/FileSystemSaver.md

```python model = build_model(xxx) de.enable_inference_mode() model.save(export_dir) ``` enable_inference_mode is used to change the graph building logic inner TFRA. It would be reduce two times memory copy in TrainableWrapper which are...

@gautam20197 As far as I know, flash attention has been implemented by nvidia in tensorflow, right? [cuda_dnn.cc](https://github.com/tensorflow/tensorflow/blob/da22a881a3d24fd4f357207034ba6c596aa414d0/tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc)

@Cjkkkk So if I understand correctly, in addition to TF/Jax, Pytorch can also use OpenXla to work with cudnn.

Is there any benchmark between CuDNN fused attention and flash attention? Recently I found TorchACC has already supported using CuDNN fused attention in PyTorch training. So there's definitely a benchmark,...