Lzhang-hub

Results 17 issues of Lzhang-hub

Through ip:5678/metric ,I get the number of containers on each GPU card and the amount of resources remaining on each card. I find there is oversold situation. container info on...

run step: ``` # build docker image ./docker/build.sh cuda # run docker docker run -it --gpus=all ait:latest bash # run scripts cd /AITemplate/examples/05_stable_diffusion python3 scripts/download_pipeline.py python3 scripts/compile.py ``` error log:...

### System Info GPU: rtx8000 Diver version: 525.85.05 Cuda version: 12.0 Syetem: ubuntu20.04 ### Who can help? _No response_ ### Information - [ ] The official example scripts - [...

bug

### Feature request support Volta gpu ### Motivation support Volta gpu ### Your contribution ....

Is any plan support Yi-Vl? https://huggingface.co/01-ai/Yi-VL-34B

## Condition: GPU: A100 40G *8 batch size=2 ## error CUDA out of memory. ## some confusion We find the gpu 0 use more memory than other gpus ![image](https://github.com/justinpinkney/stable-diffusion/assets/57925599/53e94f81-2a8d-4ba5-b49b-22996eda9607) And...

```python st=time.time() prompts=[text] config = pyfastllm.GenerationConfig() res=model.batch_response(prompts, None, config) one_time=time.time()-st print(one_time) multi_st=time.time() prompts=[text,text,text,text] config = pyfastllm.GenerationConfig() res=model.batch_response(prompts, None, config) multi_time=time.time()-multi_st print(multi_time) ``` multi_time 差不多是one_time的四倍?请教一下是有参数配置的不合理导致的嘛

I test yi-vl-6B with `srt_example_yi_vl.py` get error: ``` AttributeError: 'TokenizerManager' object has no attribute 'executor ```

# Description Flash attention had support softcap in commit [8f873cc6](https://github.com/Dao-AILab/flash-attention/commit/8f873cc6acac2933d757b2ed6069518d619b341b), which is used in [gemma2](https://storage.googleapis.com/deepmind-media/gemma/gemma-2-report.pdf). Fixes # (issue) ## Type of change - [ ] New feature (non-breaking change which...

I reinstall `pip install flash-attn==2.6.1` in NGC pytorch docker image 24.06. When I run train job, I got follow error: ``` Traceback (most recent call last): File "/data1/nfs15/nfs/bigdata/zhanglei/ai-platform/hpc-test/multi-node-train/megatron-lm-train/Megatron-LM/20240411/Megatron-LM/pretrain_gpt.py", line 8,...