Han Guo issues

Results 10 issues of


                                            Han Guo

Questions for p_gens

Thanks for this awesome implementation! I have a question regarding the dimension of `p_gens`. Why is it a list (with length equals the number of decoder steps) of scalars, but...

question

`hf-bitsandbytes-integration.md` Incorrect Dequantization

Hi, In the `bitsandbytes` [integration blog](https://github.com/huggingface/blog/blob/main/hf-bitsandbytes-integration.md), it says one could retrieve the FP16 weights via ``` (int8_model[0].weight.CB * int8_model[0].weight.SCB) / 127 ``` However, this is incorrect. In the case of...

[QST] StreamK ReductionStrategy: "Atomic" or "Mixed"

**What is your question?** Hi, I'm learning/going through the StreamK implementation in CUTLASS, and came across various reduction strategies: ```cpp /// Reduction strategy enum ReductionStrategy { kNone, // Data-parallel strategy...

question

inactive-30d

inactive-90d

`optimization/`: `num_validation_examples` batches instead of examples

Hi, it seems like the following [line](https://github.com/google-research/federated/blob/master/optimization/trainer.py#L132) takes `num_validation_examples` **_batches_** instead of examples. Is this intentional? Thanks in advance!

Why do we print just half of `trainable_params" when using 4-bits?

Hi, first of all, thanks for this amazing repo! I have a quick (and very likely dumb) question about the following line. Specifically, why do we print just half of...

[QST] `cutlass::Array` and `cute::Tensor` --- using CUTLASS utility structs/classes with CUTE (such as `NumericArrayConverter`)

**What is your question?** Hi, I'm wondering what's the proper way of using CUTLASS utility structs/classes with CUTE Tensors. A particular example I'm interested in is `NumericArrayConverter`, though that can...

question

? - Needs Triage

inactive-30d

[Trainer.train] learning rate logging inconsistency: learning rate for the future step is logged

### System Info NA ### Who can help? @muellerzr and @pacman100 ### Information - [ ] The official example scripts - [ ] My own modified scripts ### Tasks -...

trainer

[QST] GEMM Epilogue Fusion: Element-wise Ops and Two-Tensor Element-wise Multiplication

**What is your question?** Hi, I'd like to compute the following ``` D = f( matmul(A, B) ) * C ``` where `f` is an element-wise activation function, and `C`...

question

? - Needs Triage

[FEA] BFloat16x2 Atomics

Currently, CUTLASS only implements a specialization of `atomic_add` for [`half2`](https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/functional.h#L613), but not `nv_bfloat162`. This in turn limits [BlockStripedReduce](https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/block_striped.h#L241) to specialize in `half2` but not `nv_bfloat162`. Is there any reason not...

feature request

FLUTE Integration for Fast Inference

### Feature request Hi, we are big fans of the library and the NF4 data-type, so much so that we have been working on [CUDA kernels](https://github.com/HanGuo97/flute) to speed-up inference for...

enhancement