hxdtest

Results 20 issues of hxdtest

There are some arithmetic errors with the current implementation. The reason for them is probably that flash attention will return bf16 value for each block, so we cannot accumluate the...

请问最低的flash-attention版本是?

**What is your question?** In`python/cutlass/emit/pytorch.py`, bfloat16 is not supported? ``` _CUTLASS_TYPE_TO_TORCH_TYPE = { DataType.f16: "torch::kF16", DataType.f32: "torch::kF32", DataType.f64: "torch::kF64", DataType.s8: "torch::I8", DataType.s32: "torch::I32", } ```

feature request
help wanted
good first issue
question

**What is your question?** When I run `examples/35_gemm_softmax`, I use `nvcc --expt-relaxed-constexpr -I /mnt5/xuantai.hxd/cutlass/include -I /mnt5/xuantai.hxd/run_cutlass/cutlass_gemm_softmax -I /mnt5/xuantai.hxd/cutlass/tools/util/include gemm_softmax.cu -o run -std=c++17 ` to compile the file. But the ouputs...

question
? - Needs Triage

**What is your question?** It seems that add tile_description would make the gemm result different? `assert (tensor_D_numpy - tensor_D).max() == 0.0` would pass if I add tile_decription. ``` import numpy...

question
? - Needs Triage
inactive-30d

**Describe the bug** I followed [02_pytorch_extension_grouped_gemm.ipynb](https://github.com/NVIDIA/cutlass/blob/main/examples/python/02_pytorch_extension_grouped_gemm.ipynb). And I change dtype from torch.float16 to torch.bfloat16 ``` import cutlass import torch dtype = torch.bfloat16 plan = cutlass.op.GroupedGemm(element=dtype, layout=cutlass.LayoutType.RowMajor) op = plan.construct() grouped_gemm...

bug
? - Needs Triage
inactive-30d

https://github.com/volcengine/verl/blob/main/patches/megatron_v4.patch For example: - case 1 ``` - tensor_shape = [seq_length, micro_batch_size, config.hidden_size] + tensor_shape = [seq_length, micro_batch_size, hidden_size] ``` what is the difference between hidden_size and config.hidden_size? - case...

question
megatron

**Your question** Ask a clear and concise question about Flux. ``` $./scripts/launch.sh test/test_gemm_rs.py 4096 12288 49152 --dtype=bfloat16 --iters=10 torchrun --node_rank=0 --nproc_per_node=4 --nnodes=1 --rdzv_endpoint=127.0.0.1:23456 test/test_gemm_rs.py 4096 12288 49152 --dtype=bfloat16 --iters=10 W0821...

question

I am kind of confusing about transforming rAccScore to scores? what are the following operations means? ``` // ((2, 2),(MMA_M, MMA_N)) -> ((2,MMA_M),(2,MMA_N)) auto sl = logical_divide(rAccScore.layout(), Shape{}); auto rAccScore_new_layout...

In [67_hopper_fp8_warp_specialized_gemm_with_groupwise_scaling](https://github.com/NVIDIA/cutlass/blob/main/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/67_hopper_fp8_warp_specialized_gemm_with_groupwise_scaling.cu), if the params are changed like this ``` constexpr int ScaleGranularityM = 128; constexpr int ScaleGranularityN = 128; constexpr int ScaleGranularityK = 1; ``` dose it support fp8...

question
? - Needs Triage
inactive-30d