hxdtest
hxdtest
精度问题
There are some arithmetic errors with the current implementation. The reason for them is probably that flash attention will return bf16 value for each block, so we cannot accumluate the...
请问最低的flash-attention版本是?
**What is your question?** In`python/cutlass/emit/pytorch.py`, bfloat16 is not supported? ``` _CUTLASS_TYPE_TO_TORCH_TYPE = { DataType.f16: "torch::kF16", DataType.f32: "torch::kF32", DataType.f64: "torch::kF64", DataType.s8: "torch::I8", DataType.s32: "torch::I32", } ```
**What is your question?** When I run `examples/35_gemm_softmax`, I use `nvcc --expt-relaxed-constexpr -I /mnt5/xuantai.hxd/cutlass/include -I /mnt5/xuantai.hxd/run_cutlass/cutlass_gemm_softmax -I /mnt5/xuantai.hxd/cutlass/tools/util/include gemm_softmax.cu -o run -std=c++17 ` to compile the file. But the ouputs...
**What is your question?** It seems that add tile_description would make the gemm result different? `assert (tensor_D_numpy - tensor_D).max() == 0.0` would pass if I add tile_decription. ``` import numpy...
**Describe the bug** I followed [02_pytorch_extension_grouped_gemm.ipynb](https://github.com/NVIDIA/cutlass/blob/main/examples/python/02_pytorch_extension_grouped_gemm.ipynb). And I change dtype from torch.float16 to torch.bfloat16 ``` import cutlass import torch dtype = torch.bfloat16 plan = cutlass.op.GroupedGemm(element=dtype, layout=cutlass.LayoutType.RowMajor) op = plan.construct() grouped_gemm...
https://github.com/volcengine/verl/blob/main/patches/megatron_v4.patch For example: - case 1 ``` - tensor_shape = [seq_length, micro_batch_size, config.hidden_size] + tensor_shape = [seq_length, micro_batch_size, hidden_size] ``` what is the difference between hidden_size and config.hidden_size? - case...
**Your question** Ask a clear and concise question about Flux. ``` $./scripts/launch.sh test/test_gemm_rs.py 4096 12288 49152 --dtype=bfloat16 --iters=10 torchrun --node_rank=0 --nproc_per_node=4 --nnodes=1 --rdzv_endpoint=127.0.0.1:23456 test/test_gemm_rs.py 4096 12288 49152 --dtype=bfloat16 --iters=10 W0821...
I am kind of confusing about transforming rAccScore to scores? what are the following operations means? ``` // ((2, 2),(MMA_M, MMA_N)) -> ((2,MMA_M),(2,MMA_N)) auto sl = logical_divide(rAccScore.layout(), Shape{}); auto rAccScore_new_layout...
In [67_hopper_fp8_warp_specialized_gemm_with_groupwise_scaling](https://github.com/NVIDIA/cutlass/blob/main/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/67_hopper_fp8_warp_specialized_gemm_with_groupwise_scaling.cu), if the params are changed like this ``` constexpr int ScaleGranularityM = 128; constexpr int ScaleGranularityN = 128; constexpr int ScaleGranularityK = 1; ``` dose it support fp8...