Sunghwan Shim comments

Results 5 comments of


                                            Sunghwan Shim

[BUG] Running DeepSpeed with MoE inference leads to CUDA illegal memory access and NaN activation

Same kind of problem occurs when I run [`generate_text.sh` here](https://github.com/microsoft/Megatron-DeepSpeed/blob/main/examples/generate_text.sh). I've also tried the same thing with `nvcr.io/nvidia/pytorch:20.12-py3` docker image, but same error occured. **error log** ``` . . ....

[BUG] Failed to inference Megatron gpt-3 MoE model with `deepspeed.init_inference`

I managed to create `InferenceEngine` by adding some configs, but other problem occurs when running forward pass of it. Following is the revised `pretrain_gpt.py`: ```python from megatron.training import initialize_megatron, get_model,...

[BUG] Failed to inference Megatron gpt-3 MoE model with `deepspeed.init_inference`

@awan-10 Thanks for the comment. Unfortunately, I've already tried the example you shared and found it didn't work (https://github.com/microsoft/DeepSpeed/issues/2030#issuecomment-1193909540).

[BUG] Failed to inference Megatron gpt-3 MoE model with `deepspeed.init_inference`

I've also tried this on a machine with v100 32G * 8, but failed with almost same error. Does the script only run on A100?

RISC-V system register 접근 시 이미 구현된 crate를 이용하도록 변경

해당 library가 std에 의존합니다. 다른 좋은 library를 찾지 못해서 일단 보류하겠습니다.