He Jia comments

Results 87 comments of


                                            He Jia

The sample demo have some bug.

> > @alykhantejani Most of TFRA users are using GPU sync training without PS. So it's few people to aware this issue. If this issue occurs only some of the...

The sample demo have some bug.

> > > It seems that `user_embedding = de.keras.layers.SquashedEmbedding( user_embedding_size, initializer=embedding_initializer, devices=self.devices, name='user_embedding')` has some bugs。The embedding could not cannot identify which port the variables are on. > > >...

The sample demo have some bug.

> > @alykhantejani Don’t worry the memory, DE alltoall embedding layer will shard entire embedding into different worker rank. And also you can use cpu embedding table, but DE HKV...

The sample demo have some bug.

@alykhantejani Here is the demo: https://github.com/tensorflow/recommenders-addons/tree/master/demo/dynamic_embedding/movielens-1m-keras-with-horovod If you want to place the embedding in host memory, please set parameter devices=["CPU"] when you create embedding layer. If you want to use...

inference mode issue

@beijinggao Do you have any questions? If not, I will close the issue.

WarmStartHook bug: AttributeError: 'Tensor' object has no attribute '_resource_handle'

_resource_handle是TFRA table的handle，saveable.op对象应该是TFRA的对象而非const string。所以需要检查代码中返回的saveables对象是不是合法的，只有合法的TFRA saveable才能够运行之后的代码。

`freq_var` is not saved in `save_weights()`, which causes `restrict()` does not work on loaded models

This should be because freq_var is not caused by track. For the time being, you can use TF manual loaded apis to use. @huangenyan Fixed in PR #415

What's the difference of flash attention implement between cudnn and Dao-AILab?

@mnicely I have tested cudnn attention in A30 with image nvcr.io/nvidia/pytorch:24.04-py3. it is much slower than flash attention in the same image. =====================TEST CUDNN Attention===================== /workspace/qkv_attention.py:34: UserWarning: USING CUDNN SDPA...

[ENHANCEMENT] support zero 2 distributed optimize

Any progress?

关于基于megatron的LLAMA 推理相关代码

可以直接改，一般不会有太多问题，不过推理的话vllm比fast transformer之类的吞吐高很多，推荐换个框架