Ma Mingfei
Ma Mingfei
Yes, the current `scatter_add` is kind of like `sorting` + `segment_add`, and both parts are properly paralleled. Because of the semantics limitation, we can not skip sorting since from PyTorch...
Update on `spmm` optimizations, PR submitted at https://github.com/pytorch/pytorch/pull/83727. Port `spmm` reduction from `torch-sparse` to `torch`, the current PR is only for demonstrating performance gains, API definition needs more amendment. Now...
#### Update on Training optimization Optimize `torch.gather` for the classic pyg use case (`index` tensor is broadcasted), this will be the backward for `scatter` in training, https://github.com/pytorch/pytorch/pull/87586 When the `index`...
Sort out the 2nd stage optimization a little bit: - [x] optimization of `sampled_addmm` on SparseCSR: https://github.com/pytorch/pytorch/pull/90978 - [x] enabling of `sampled_addm` on SparseCOO (canceled) - [ ] unify `ReduceTypes`:...
> Hi, any update on this? [NicolasHug](https://github.com/NicolasHug) and [vfdev-5](https://github.com/vfdev-5) have done a lot of job in optimizing int8/uint8 image scaling/resize on torch.
@CaoE https://github.com/pytorch/pytorch/pull/99539 this one will be landed first, rebase again when it is settled. and add more unit test cases for both inference and training.
@mikekgfb cool ! We are just about to do something similar on the CPU device. We will add native support for int4 kernels on CPU: * [_convert_weight_to_int4pack_cuda](https://github.com/pytorch/pytorch/blob/fcf6a76108be8e7b6db528b631a7b8ebdc7470ac/aten/src/ATen/native/native_functions.yaml#L4059) * [_weight_int4pack_mm_cuda](https://github.com/pytorch/pytorch/blob/fcf6a76108be8e7b6db528b631a7b8ebdc7470ac/aten/src/ATen/native/native_functions.yaml#L4065) So...
Cool this is merged :) I will wrap up code and upstream CPU backend optimization kernels to pytorch soon.
> Planning to use Sglang on Intel Gaudi 2, but I have not tried it yet. Would like to know the current support level? @xinyu-intel we don't have binding for...
@ggerganov could you please take a look at this one? I have moved the amx init code from ggml.c to ggml-amx/mmq.cpp according to previous comments.