Chevolier comments

Results 16 comments of


                                            Chevolier

Trying to execute run.py in train folder renders an error

I met the same issue, the memory keeps increasing to 256GB in the data loading process until it got killed by the system, any solution to solve this?

Trying to execute run.py in train folder renders an error

> I met the same issue, the memory keeps increasing to 256GB in the data loading process until it got killed by the system, any solution to solve this? Updates:...

【BUG】occur error：AttributerError：'DeepSpeedHybridEngine' object has no attribute 'mp_group' whiling run llama7b for step3/rlhf/ppo

Any solutions? I encountered the same issue with bloomz model, so far I just removed --enable_hybrid_engine to bypass the issue, and the program runs. However, guess the efficiency would reduce.

Does this program supports tensorboard?

> @Chevolier, can you please clarify the program you are referring to? It would be helpful to share what you are running and the expected output. Thanks! I mean the...

Batching not working : QPS remains same on increasing batch size

Same problem, dynamic batching does not work. Environment: docker imges nvcr.io/nvidia/tritonserver:24.01-trtllm-python-py3 tensorrt_llm==0.7.1 tensorrtllm_backend==0.7.1 Any way to solve this problem?

Batching not working : QPS remains same on increasing batch size

> Same problem, dynamic batching does not work. Environment: docker imges nvcr.io/nvidia/tritonserver:24.01-trtllm-python-py3 tensorrt_llm==0.7.1 tensorrtllm_backend==0.7.1 Any way to solve this problem? The following steps helped solve the problem in the above...

Chevolier

Trying to execute run.py in train folder renders an error

Trying to execute run.py in train folder renders an error

【BUG】occur error：AttributerError：'DeepSpeedHybridEngine' object has no attribute 'mp_group' whiling run llama7b for step3/rlhf/ppo

Does this program supports tensorboard?

Batching not working : QPS remains same on increasing batch size

Batching not working : QPS remains same on increasing batch size

Last token repeat after adding end_id for Baichuan2-13B-Chat

[QST] Installing RAPIDS 24.06 not recognizing UCX version

启动webui后，点击任何按钮都报超时错误

[Bug] sglang doesn't stop the generation when the request is canceled