Qing Lan comments

Results 80 comments of


Qing Lan

[BUG][0.6.7] garbage output for multi-gpu with tutorial

This is also reproducible on GPT-J-6B model if you simply switch it

[BUG] Cannot run DeepSpeed with transformers on NVIDIA Tesla T4 GPU

@jeffra Here is the pip list ``` root@b963730b133d:/# pip3 list Package Version ------------------ ------------ certifi 2022.5.18.1 charset-normalizer 2.0.12 deepspeed 0.6.5 filelock 3.7.1 hjson 3.0.2 huggingface-hub 0.7.0 idna 3.3 ninja 1.10.2.3...

[BUG] Cannot run DeepSpeed with transformers on NVIDIA Tesla T4 GPU

@jeffra just tested NVIDIA V100 GPU series with 4GPU onboard, the code just works fine, which match to what you have tested. So I think we can just narrow down...

[BUG] Cannot run DeepSpeed with transformers on NVIDIA Tesla T4 GPU

Just raised a separate issue here: https://github.com/microsoft/DeepSpeed/issues/2113 Still reproducible on 0.6.7 @jayargo did you managed to fix it by any chance?

Mac M1 GPU support via Pytorch and MPS

Currently we are already using the offical torch 1.12.1 MacOS Arm64 wheel to build the JNI. So it has capabitlity to extend to GPU, but we didn't implement the MPS...

[BUG] Cannot run DeepSpeed with transformers on NVIDIA Tesla T4 GPU

@jeffra Just tested again with single GPU, the error is gone. I also tested multi-gpu and facing some NCCL issue. I think this might leads to the machine is not...

[RFC] Graduate MXNet from Apache Incubator

Since we are moving towards the graduation. I would also want to check with everyone if we would like to prmote all of our committers to PMC as we graduate....

Mask-Rcnn transfer learning example

Yes there should be a way to achieve that, which DL framework you are looking for? Currently we support MXNet transfer learning and experimental PyTorch Transfer Learning. We have MaskRCNN...

Mask-Rcnn transfer learning example

So here is the steps. 1) Get the MaskRCNN pretrained model: http://docs.djl.ai/mxnet/mxnet-model-zoo/index.html, you can choose one of the backbone. Try to get it run with our instance segmentation example http://docs.djl.ai/mxnet/mxnet-model-zoo/index.html....

org.tensorflow.exceptions.TFInvalidArgumentException

Do you have any code to share to reproduce your issues?