Qing Lan

Results 80 comments of Qing Lan

@michaelfeil I believe what you said has nothing to do with how do we do Tensor Parallel communication. Ray is a good package for starting a cross node (machine) cluster...

@WoosukKwon MPI should be easy enough to do that if you want, there is `mpi4py` python package allow you to send serialized object. torch.dist can only pass through tensors that...

If it is purely cross CPU process communication, you can also do https://docs.python.org/3/library/multiprocessing.shared_memory.html shared memory access control. Launch instance in [pure spin multi-processes](https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/launcher/launch.py) or use MPI, or torch.dist. All should...

@tmichniewski hi, we were able to use GPU to run inference. For GPU use case, you need to make sure your operator and model are loaded on GPU device. You...

In our test platform, we benchmarked ResNet50 Image classification model, there is a small gap between Python GPU and Java GPU (you can find my issue filed in here). However,...

This is not a valid inputs, given max_length < input_token size. You may want to use a value larger than input token length, or use `max_new_tokens` instead to avoid input...

@RezaYazdaniAminabadi this solution will not work on Falcon 7B since the modelling file is different. I think this is a bug HuggingFace need to solve, but just FYI. Maybe some...

> I actually have a question from you guys, has anyone tested the inference of this model on [text_generation_inference](https://github.com/huggingface/text-generation-inference) system from HuggingFace? Yes. What information do you need?

@RezaYazdaniAminabadi So for the Falcon kernel you created (06/20). It is faster than TextGeneration Flash implementation with sequence length < 256. Kernel crashes on longer sequence length. We cannot do...

@RezaYazdaniAminabadi could we merge this?