Han Guo
Han Guo
Thanks for this awesome implementation! I have a question regarding the dimension of `p_gens`. Why is it a list (with length equals the number of decoder steps) of scalars, but...
Hi, In the `bitsandbytes` [integration blog](https://github.com/huggingface/blog/blob/main/hf-bitsandbytes-integration.md), it says one could retrieve the FP16 weights via ``` (int8_model[0].weight.CB * int8_model[0].weight.SCB) / 127 ``` However, this is incorrect. In the case of...
**What is your question?** Hi, I'm learning/going through the StreamK implementation in CUTLASS, and came across various reduction strategies: ```cpp /// Reduction strategy enum ReductionStrategy { kNone, // Data-parallel strategy...
Hi, it seems like the following [line](https://github.com/google-research/federated/blob/master/optimization/trainer.py#L132) takes `num_validation_examples` **_batches_** instead of examples. Is this intentional? Thanks in advance!
Hi, first of all, thanks for this amazing repo! I have a quick (and very likely dumb) question about the following line. Specifically, why do we print just half of...
**What is your question?** Hi, I'm wondering what's the proper way of using CUTLASS utility structs/classes with CUTE Tensors. A particular example I'm interested in is `NumericArrayConverter`, though that can...
### System Info NA ### Who can help? @muellerzr and @pacman100 ### Information - [ ] The official example scripts - [ ] My own modified scripts ### Tasks -...
**What is your question?** Hi, I'd like to compute the following ``` D = f( matmul(A, B) ) * C ``` where `f` is an element-wise activation function, and `C`...
Currently, CUTLASS only implements a specialization of `atomic_add` for [`half2`](https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/functional.h#L613), but not `nv_bfloat162`. This in turn limits [BlockStripedReduce](https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/block_striped.h#L241) to specialize in `half2` but not `nv_bfloat162`. Is there any reason not...
### Feature request Hi, we are big fans of the library and the NF4 data-type, so much so that we have been working on [CUDA kernels](https://github.com/HanGuo97/flute) to speed-up inference for...