hxdtest issues

Results 20 issues of


                                            hxdtest

精度问题

There are some arithmetic errors with the current implementation. The reason for them is probably that flash attention will return bf16 value for each block, so we cannot accumluate the...

[QST] Why _CUTLASS_TYPE_TO_TORCH_TYPE doesn't support torch.bfloat16?

**What is your question?** In`python/cutlass/emit/pytorch.py`, bfloat16 is not supported? ``` _CUTLASS_TYPE_TO_TORCH_TYPE = { DataType.f16: "torch::kF16", DataType.f32: "torch::kF32", DataType.f64: "torch::kF64", DataType.s8: "torch::I8", DataType.s32: "torch::I32", } ```

feature request

help wanted

good first issue

question

[QST] How to compile and run `examples/35_gemm_softmax` ?

**What is your question?** When I run `examples/35_gemm_softmax`， I use `nvcc --expt-relaxed-constexpr -I /mnt5/xuantai.hxd/cutlass/include -I /mnt5/xuantai.hxd/run_cutlass/cutlass_gemm_softmax -I /mnt5/xuantai.hxd/cutlass/tools/util/include gemm_softmax.cu -o run -std=c++17 ` to compile the file. But the ouputs...

question

? - Needs Triage

[QST] Gemm results are different with tile_description?

**What is your question?** It seems that add tile_description would make the gemm result different? `assert (tensor_D_numpy - tensor_D).max() == 0.0` would pass if I add tile_decription. ``` import numpy...

question

? - Needs Triage

inactive-30d

`02_pytorch_extension_grouped_gemm.ipynb` No kernel configuration found for supported data type and layout combination (<DataType.bf16: 16>

**Describe the bug** I followed [02_pytorch_extension_grouped_gemm.ipynb](https://github.com/NVIDIA/cutlass/blob/main/examples/python/02_pytorch_extension_grouped_gemm.ipynb). And I change dtype from torch.float16 to torch.bfloat16 ``` import cutlass import torch dtype = torch.bfloat16 plan = cutlass.op.GroupedGemm(element=dtype, layout=cutlass.LayoutType.RowMajor) op = plan.construct() grouped_gemm...

bug

? - Needs Triage

inactive-30d

Why the `magatron_v4.patch` is needed?

https://github.com/volcengine/verl/blob/main/patches/megatron_v4.patch For example: - case 1 ``` - tensor_shape = [seq_length, micro_batch_size, config.hidden_size] + tensor_shape = [seq_length, micro_batch_size, hidden_size] ``` what is the difference between hidden_size and config.hidden_size？ - case...

question

megatron

[QUESTION] Why flux gemm_rs is not faster than torch?

**Your question** Ask a clear and concise question about Flux. ``` $./scripts/launch.sh test/test_gemm_rs.py 4096 12288 49152 --dtype=bfloat16 --iters=10 torchrun --node_rank=0 --nproc_per_node=4 --nnodes=1 --rdzv_endpoint=127.0.0.1:23456 test/test_gemm_rs.py 4096 12288 49152 --dtype=bfloat16 --iters=10 W0821...

question

How to transform rAccScore to scores?

I am kind of confusing about transforming rAccScore to scores? what are the following operations means? ``` // ((2, 2),(MMA_M, MMA_N)) -> ((2,MMA_M),(2,MMA_N)) auto sl = logical_divide(rAccScore.layout(), Shape{}); auto rAccScore_new_layout...

[QST] Support fp8 gemm with 128x1 LHS scaling and 1x128 RHS scaling

In [67_hopper_fp8_warp_specialized_gemm_with_groupwise_scaling](https://github.com/NVIDIA/cutlass/blob/main/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/67_hopper_fp8_warp_specialized_gemm_with_groupwise_scaling.cu), if the params are changed like this ``` constexpr int ScaleGranularityM = 128; constexpr int ScaleGranularityN = 128; constexpr int ScaleGranularityK = 1; ``` dose it support fp8...

question

? - Needs Triage

inactive-30d