Qing Lan comments

Results 80 comments of


Qing Lan

Remove Ray for the dependency

@michaelfeil I believe what you said has nothing to do with how do we do Tensor Parallel communication. Ray is a good package for starting a cross node (machine) cluster...

Remove Ray for the dependency

@WoosukKwon MPI should be easy enough to do that if you want, there is `mpi4py` python package allow you to send serialized object. torch.dist can only pass through tensors that...

If it is purely cross CPU process communication, you can also do https://docs.python.org/3/library/multiprocessing.shared_memory.html shared memory access control. Launch instance in [pure spin multi-processes](https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/launcher/launch.py) or use MPI, or torch.dist. All should...

The Java Tensorflow library does not seem to be using GPU

@tmichniewski hi, we were able to use GPU to run inference. For GPU use case, you need to make sure your operator and model are loaded on GPU device. You...

The Java Tensorflow library does not seem to be using GPU

In our test platform, we benchmarked ResNet50 Image classification model, there is a small gap between Python GPU and Java GPU (you can find my issue filed in here). However,...

DeepSpeed streaming, max_length is ignored

This is not a valid inputs, given max_length < input_token size. You may want to use a value larger than input token length, or use `max_new_tokens` instead to avoid input...

Add FALCON-40B Inference-Kernel Support

@RezaYazdaniAminabadi this solution will not work on Falcon 7B since the modelling file is different. I think this is a bug HuggingFace need to solve, but just FYI. Maybe some...

Add FALCON-40B Inference-Kernel Support

> I actually have a question from you guys, has anyone tested the inference of this model on [text_generation_inference](https://github.com/huggingface/text-generation-inference) system from HuggingFace? Yes. What information do you need?

Add FALCON-40B Inference-Kernel Support

@RezaYazdaniAminabadi So for the Falcon kernel you created (06/20). It is faster than TextGeneration Flash implementation with sequence length < 256. Kernel crashes on longer sequence length. We cannot do...

Add FALCON Auto-TP Support

@RezaYazdaniAminabadi could we merge this?