DeepRec issues

enable TF_USE_CUBLASLT on gpu for _FusedMatMul, but coredump

export TF_USE_CUBLASLT=1 got

houjincheng1992

[Rendezvous] RemoteRendezvous supports FlowControl.

JackMoriarty

[EV] Downgrade log level in ev restore.

JackMoriarty

sparse feature importance support by DeepRec

1

**System information** - DeepRec version (you are using): 1.15.5+deeprec2208 - Are you willing to contribute it (Yes/No): Yes **Describe the feature and the current behavior/state.** how deeprec analyze the sparse...

welsonzhang

Memory leak in ParquetDataset

4

**System information** - OS Platform and Distribution (e.g., Linux Ubuntu 16.04): - DeepRec version or commit id: 1.15.5+deeprec2208 - Python version: python3.6 - Bazel version (if compiling from source): bazel...

welsonzhang

r1.15.5-deeprec2302 incr ev 在 restore过程中不能正确加载

4

**System information** - OS Platform and Distribution (e.g., Linux Ubuntu 20.04): Ubuntu 20.04 - DeepRec version or commit id: 23252970 - Python version: python3.6.9 - Bazel version (if compiling from...

HH-66

bug

分布式训练期间ps的cpu持续增长，训练效率下降问题

2

### 训练期间ps的cpu使用率变化情况 ![image](https://github.com/user-attachments/assets/5e18f14b-a2eb-4697-9512-b0b0198afd95) ### 训练期间chief的cpu使用率变化情况（worker类似） ![image](https://github.com/user-attachments/assets/6a30c578-ec57-4943-b5f1-cf061c70831e) ### 训练期间每秒训练批次变化情况 ![image](https://github.com/user-attachments/assets/b1d70ba6-d496-4bd8-8a9a-fc0c1ef718d6) tf1.15版本，使用1chief 1ps 4worker进行分布式训练，训练期间ps的cpu持续增长，chief和worker的cpu后续有降低的情况，每秒训练批次也变少了，这是因为什么原因？

Christian9971

Evict支持scatter_update操作吗

1

Evict支持scatter_update操作吗 ev_opt = tf.EmbeddingVariableOption(init_option=init_opt, filter_option=filter_opt, evict_option=evict_opt) emb_table = tf.get_embedding_variable("ev_emb_table", embedding_dim=64, partitioner=tf.fixed_size_partitioner(num_shards=10), ev_option=ev_opt) emb_table能像普通embedding_table一样支持scatter_update操作吗

surpercode

提供多天流式训练的demo

**System information** - DeepRec version (you are using): - Are you willing to contribute it (Yes/No):yes **Describe the feature and the current behavior/state.** 推荐一般需要回溯很多天的数据，进行eval/train 流式训练，可以提供多天流式训练的demo例子吗? 例如这种： https://github.com/PaddlePaddle/PaddleRec/blob/master/doc/online_trainer.md **Will this change...

welsonzhang

kafkadataset 多partition下会产生非常多线程

1

你好, 调试中遇到一个问题, 有些疑问, 如果方便, 辛苦帮忙解答. 感谢我在测试kafkadataset的时候, 会产生非常多线程. 但有利用率的占比不高. 随着partition数的增多, 线程数成比例增加. (1) 单机情况下, 200个partition, 可以产生超过15000个线程. (dataset设置的并行度是3, inter_op, intra_op都默认, 在128core, 64core, 32core机器测试结果都一样线程数一样多). (2) 分布式情况下, 2个worker, 各assign 100个partition, 单worker线程数减半(7-8k个), 总数不变同理, 扩展到4个,...

victory1210

DeepRec
DeepRec copied to clipboard

Metadata

enable TF_USE_CUBLASLT on gpu for _FusedMatMul, but coredump

[Rendezvous] RemoteRendezvous supports FlowControl.

[EV] Downgrade log level in ev restore.

sparse feature importance support by DeepRec

Memory leak in ParquetDataset

r1.15.5-deeprec2302 incr ev 在 restore过程中不能正确加载

分布式训练期间ps的cpu持续增长，训练效率下降问题

Evict支持scatter_update操作吗

提供多天流式训练的demo

kafkadataset 多partition下会产生非常多线程

← Metadata

Owner

Metadata

DeepRec DeepRec copied to clipboard

Metadata

← Metadata

Owner

Metadata

DeepRec
DeepRec copied to clipboard