cutlass
cutlass copied to clipboard
[QST] BatchNorm with cutlass
I want to implement BN layer as an epilogue with cutlass, which requires both division and plus operations. I want to know is there a way to implement something like this? Thank you!
https://github.com/NVIDIA/cutlass/blob/master/test/unit/conv/device/conv2d_fprop_with_reduction_sm75.cu is an example of calculating sum needed by BN. It involves 2 kernels. The first one calculates partial sum of each threadblock. The 2nd kernel calculates the final sum.
You need to extend it to use the same way to calculate square sum together with sum.
Thank you for the answer! My motivation is to use BN in inference stage, so can i just pass the trained mean and var in tensor C and tensor D to the epilogue?
The standard conv we do is D = alpha x conv(A, B) + beta x C
Well, so that means that i should put the variance in tensor E?
In inference, we don't need batch norm. So, I am not really sure what you need. If you variance is a scalar, you can set it as alpha
. If it is a per channel vector, you can take a look at this example https://github.com/NVIDIA/cutlass/blob/master/test/unit/conv/device/conv2d_fprop_with_broadcast_sm75.cu#L113. You can customize your elementwise computation here: https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/epilogue/thread/linear_combination_residual_block.h#L150 . bias
is one per channel vector, residual
can also be a per channel vector if you use stride=0
trick as https://github.com/NVIDIA/cutlass/blob/master/examples/17_fprop_per_channel_bias/fprop_per_channel_bias.cu .
thank you, i'll take a look.
@hwu36 hello,i follow your instructions and make a new epilogue, but i cannot get through the compilation step. Here is the error Log.
error: function "cutlass::gemm::warp::MmaTensorOp<Shape_, ElementA_, LayoutA_, ElementB_, LayoutB_, ElementC_, LayoutC_, Policy_, PartitionsK_, AccumulatorsInRowMajor, Enable>::operator() [with Shape_=cutlass::gemm::GemmShape<64, 64, 32>, ElementA_=float, LayoutA_=cutlass::layout::RowMajorTensorOpMultiplicandCrosswise<32, 32>, ElementB_=float, LayoutB_=cutlass::layout::ColumnMajorTensorOpMultiplicandCrosswise<32, 32>, ElementC_=float, LayoutC_=cutlass::layout::RowMajor, Policy_=cutlass::gemm::warp::MmaTensorOpPolicy<cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 8>, 32, cutlass::tfloat32_t, cutlass::layout::RowMajor, cutlass::tfloat32_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, cutlass::arch::OpMultiplyAdd>, cutlass::MatrixShape<1, 1>>, PartitionsK_=1, AccumulatorsInRowMajor=false, Enable=__nv_bool]" cannot be called with the given argument list
argument types are: (cutlass::Array<float, 128, true>, cutlass::Array<float, 16, true>, cutlass::Array<float, 16, true>, cutlass::Array<float, 128, true>)
object type is: cutlass::gemm::warp::MmaTensorOp<cutlass::gemm::GemmShape<64, 64, 32>, float, cutlass::layout::RowMajorTensorOpMultiplicandCrosswise<32, 32>, float, cutlass::layout::ColumnMajorTensorOpMultiplicandCrosswise<32, 32>, float, cutlass::layout::RowMajor, cutlass::gemm::warp::MmaTensorOpPolicy<cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 8>, 32, cutlass::tfloat32_t, cutlass::layout::RowMajor, cutlass::tfloat32_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, cutlass::arch::OpMultiplyAdd>, cutlass::MatrixShape<1, 1>>, 1, false, __nv_bool>
detected during:
instantiation of "void cutlass::conv::threadblock::ImplicitGemmPipelined<Shape_, IteratorA_, SmemIteratorA_, IteratorB_, SmemIteratorB_, ElementC_, LayoutC_, Policy_, TransformA_, TransformB_, Enable>::operator()(int, cutlass::conv::threadblock::ImplicitGemmPipelined<Shape_, IteratorA_, SmemIteratorA_, IteratorB_, SmemIteratorB_, ElementC_, LayoutC_, Policy_, TransformA_, TransformB_, Enable>::FragmentC &, cutlass::conv::threadblock::ImplicitGemmPipelined<Shape_, IteratorA_, SmemIteratorA_, IteratorB_, SmemIteratorB_, ElementC_, LayoutC_, Policy_, TransformA_, TransformB_, Enable>::IteratorA, cutlass::conv::threadblock::ImplicitGemmPipelined<Shape_, IteratorA_, SmemIteratorA_, IteratorB_, SmemIteratorB_, ElementC_, LayoutC_, Policy_, TransformA_, TransformB_, Enable>::IteratorB, const cutlass::conv::threadblock::ImplicitGemmPipelined<Shape_, IteratorA_, SmemIteratorA_, IteratorB_, SmemIteratorB_, ElementC_, LayoutC_, Policy_, TransformA_, TransformB_, Enable>::FragmentC &, cutlass::conv::threadblock::ImplicitGemmPipelined<Shape_, IteratorA_, SmemIteratorA_, IteratorB_, SmemIteratorB_, ElementC_, LayoutC_, Policy_, TransformA_, TransformB_, Enable>::TransformA, cutlass::conv::threadblock::ImplicitGemmPipelined<Shape_, IteratorA_, SmemIteratorA_, IteratorB_, SmemIteratorB_, ElementC_, LayoutC_, Policy_, TransformA_, TransformB_, Enable>::TransformB) [with Shape_=cutlass::gemm::GemmShape<128, 128, 32>, IteratorA_=cutlass::conv::threadblock::TileIterator<cutlass::conv::threadblock::Conv2dFpropActivationTileAccessIteratorAnalytic<cutlass::MatrixShape<128, 32>, ElementA, cutlass::layout::TensorNCHW, cutlass::transform::PitchLinearWarpRakedThreadMap<cutlass::layout::PitchLinearShape<32, 128>, 128, cutlass::layout::PitchLinearShape<8, 4>, 4>, cutlass::AlignedArray<ElementA, 4, 16>>>, SmemIteratorA_=cutlass::transform::threadblock::RegularTileIterator<cutlass::MatrixShape<128, 32>, ElementA, cutlass::layout::RowMajorTensorOpMultiplicandCrosswise<32, 32>, 0, cutlass::transform::PitchLinearWarpRakedThreadMap<cutlass::layout::PitchLinearShape<32, 128>, 128, cutlass::layout::PitchLinearShape<8, 4>, 4>, 16>, IteratorB_=cutlass::conv::threadblock::TileIterator<cutlass::conv::threadblock::Conv2dFpropFilterTileAccessIteratorAnalytic<cutlass::MatrixShape<32, 128>, ElementB, cutlass::layout::TensorNCHW, cutlass::transform::PitchLinearWarpRakedThreadMap<cutlass::layout::PitchLinearShape<32, 128>, 128, cutlass::layout::PitchLinearShape<8, 4>, 4>, cutlass::AlignedArray<ElementA, 4, 16>>>, SmemIteratorB_=cutlass::transform::threadblock::RegularTileIterator<cutlass::MatrixShape<32, 128>, ElementB, cutlass::layout::ColumnMajorTensorOpMultiplicandCrosswise<32, 32>, 1, cutlass::transform::PitchLinearWarpRakedThreadMap<cutlass::layout::PitchLinearShape<32, 128>, 128, cutlass::layout::PitchLinearShape<8, 4>, 4>, 16>, ElementC_=ElementC, LayoutC_=cutlass::layout::TensorNCHW, Policy_=cutlass::gemm::threadblock::MmaPolicy<cutlass::gemm::warp::MmaTensorOp<cutlass::gemm::GemmShape<64, 64, 32>, float, cutlass::layout::RowMajorTensorOpMultiplicandCrosswise<32, 32>, float, cutlass::layout::ColumnMajorTensorOpMultiplicandCrosswise<32, 32>, float, cutlass::layout::RowMajor, cutlass::gemm::warp::MmaTensorOpPolicy<cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 8>, 32, cutlass::tfloat32_t, cutlass::layout::RowMajor, cutlass::tfloat32_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, cutlass::arch::OpMultiplyAdd>, cutlass::MatrixShape<1, 1>>, 1, false, __nv_bool>, cutlass::MatrixShape<0, 0>, cutlass::MatrixShape<0, 0>, 1>, TransformA_=cutlass::NumericArrayConverter<ElementA, ElementA, 32, cutlass::FloatRoundStyle::round_to_nearest, cutlass::transform::thread::UnaryTransform::Identity>, TransformB_=cutlass::NumericArrayConverter<ElementA, ElementA, 32, cutlass::FloatRoundStyle::round_to_nearest, cutlass::transform::thread::UnaryTransform::Identity>, Enable=__nv_bool]"
Here is my definition of the arguments:
typename Conv2dFprop::Arguments arguments{{
problem_size,
tensor_a, // <- reference to tensor A on device
tensor_b, // <- reference to tensor B on device
// tensor C is treated as the bias vector. We can enable the CONV
// to project away the N, H, W dimension by setting the stride to zero.
{{tensor_bias.data(), LayoutC::Stride(0)}},
tensor_d, // <- reference to tensor D on device
{{alpha, beta}}
}};
And I also change the epilogue's () operations as followed:
void operator()(FragmentOutput &frag_Z, FragmentOutput &, FragmentAccumulator const &AB,
FragmentC const &residual,
FragmentCompute const &bias) const {
// UnaryOp unary_op;
// BinaryOp binary_op;
ActivationOp activation;
FragmentCompute tmp_Accum =
NumericArrayConverter<ElementCompute, ElementAccumulator, kElementsPerAccess>()(AB);
FragmentCompute tmp_residual =
NumericArrayConverter<ElementCompute, ElementC, kElementsPerAccess>()(residual);
FragmentCompute z = activation(alpha_ * tmp_Accum + bias);
// FragmentCompute result_Z = skip_elementwise_ ? z : unary_op(z);
FragmentCompute result_Z = z;
NumericArrayConverter<ElementOutput, ElementCompute, kElementsPerAccess> convert_z;
frag_Z = convert_z(result_Z);
}
I remove the unaryop and binaryop from the class since they are not used. In fact, i don't use the tmp_residual either. Is there any probably errors? thank you.
Are you using fp32
for inference?
Your error is that it cannot find corresponding operator()
for cutlass::gemm::warp::MmaTensorOp
. Your top level template configuration should be something like this:
cutlass::conv::kernel::DefaultConv2dFprop<
float,
cutlass::layout::TensorNHWC,
float,
cutlass::layout::TensorNHWC,
float,
cutlass::layout::TensorNHWC,
float,
cutlass::arch::OpClassTensorOp,
cutlass::arch::Sm80,
cutlass::gemm::GemmShape<64, 64, 32>,
cutlass::gemm::GemmShape<32, 32, 32 >,
cutlass::gemm::GemmShape<16, 8, 8>,
cutlass::epilogue::thread::LinearCombination<
float,
4,
float,
float
>,
cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<4>, // cutlass::gemm::threadblock::GemmSplitKIdentityThreadblockSwizzle<>,
5,
cutlass::arch::OpMultiplyAdd,
cutlass::conv::IteratorAlgorithm::kOptimized,
cutlass::conv::StrideSupport::kStrided,
4,
4
>::Kernel;
thank you, but after I change the type definition as you instructed, it still tells me the implicitGemm is an incomplete type. Here is all my definition:
#include <cutlass/cutlass.h>
#include <cutlass/array.h>
#include <cutlass/epilogue/thread/linear_combination_bias_elementwise.h>
#include <cutlass/epilogue/thread/linear_combination_residual_block.h>
#include <cutlass/epilogue/thread/activation.h>
#include <cutlass/conv/kernel/default_conv2d_fprop_with_broadcast.h>
#include <cutlass/conv/device/implicit_gemm_convolution.h>
#define CUTLASS_CHECK(status) \
{ \
cutlass::Status error = status; \
if (error != cutlass::Status::kSuccess) { \
std::cerr << "Got cutlass error: " \
<< cutlassGetStatusString(error) << " at: " << __LINE__ \
<< std::endl; \
exit(EXIT_FAILURE); \
} \
}
using ElementA = float;
using ElementB = float;
using ElementC = float;
using LayoutA = cutlass::layout::TensorNCHW;
using LayoutB = cutlass::layout::TensorNCHW;
using LayoutC = cutlass::layout::TensorNCHW;
using ElementD = ElementC;
using ElementAccumulator = ElementC;
using ElementCompute = ElementAccumulator;
// using ActivationOp = cutlass::epilogue::thread::ReLu;
using EpilogueOutputOp = cutlass::epilogue::thread::LinearCombinationBNBlock<
ElementD,
ElementAccumulator,
ElementCompute,
ElementC,
4,
cutlass::epilogue::thread::ReLu
>;
using Conv2dFpropKernel = typename cutlass::conv::kernel::DefaultConv2dFprop<
float,
cutlass::layout::TensorNCHW,
float,
cutlass::layout::TensorNCHW,
float,
cutlass::layout::TensorNCHW,
float,
cutlass::arch::OpClassTensorOp,
cutlass::arch::Sm80,
cutlass::gemm::GemmShape<64, 64, 32>,
cutlass::gemm::GemmShape<32, 32, 32 >,
cutlass::gemm::GemmShape<16, 8, 8>,
EpilogueOutputOp,
cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<4>,
4,
cutlass::arch::OpMultiplyAdd,
cutlass::conv::IteratorAlgorithm::kOptimized,
cutlass::conv::StrideSupport::kStrided,
4,
4
>::Kernel;
using Conv2dFprop = cutlass::conv::device::ImplicitGemmConvolution<Conv2dFpropKernel>;
and here is my instantiation:
ElementCompute alpha = {w};
ElementCompute beta = 0.;
cutlass::conv::Conv2dProblemSize problem_size(
{{x_shape0, x_shape1, x_shape2, x_shape3}}, // activation
{{{in_channel}, {out_channel}, {kernel_size[0]}, {kernel_size[1]}}}, // filter
{{{padding}, {padding}, {padding}, {padding}}}, // padding
{{{stride}, {stride}}}, // striding
{{{dilation}, {dilation}}}, // dilation
cutlass::conv::Mode::kCrossCorrelation, // mode (convolution or cross-correlation)
1 // split-k slices
);
cutlass::TensorRef<ElementA, LayoutA> tensor_a((ElementA*)x_p,
LayoutA().packed({{x_shape0, x_shape1, x_shape2, x_shape3}}));
cutlass::TensorRef<ElementB, LayoutB> tensor_b((ElementB*)weight_p,
LayoutB().packed({{{in_channel}, {out_channel}, {kernel_size[0]}, {kernel_size[1]}}}));
cutlass::TensorRef<ElementC, LayoutC> tensor_d((ElementC*)output_p, LayoutC().packed({{output_shape0, output_shape1, output_shape2, output_shape3}}));
cutlass::TensorRef<ElementC, LayoutC> tensor_bias((ElementC*)bnbias_p, LayoutC().packed({{output_shape0, output_shape1, output_shape2, output_shape3}}));
// cutlass::TensorRef<ElementC, LayoutC> tensor_bias = {{nullptr, Convolution::LayoutDst()}};
typename ImplicitGemm::Arguments arguments{{
problem_size,
tensor_a.device_ref(), // <- reference to tensor A on device
tensor_b.device_ref(), // <- reference to tensor B on device
// tensor C is treated as the bias vector. We can enable the CONV
// to project away the N, H, W dimension by setting the stride to zero.
tensor_bias.device_data(), LayoutC::Stride(0),
tensor_d.device_ref(), // <- reference to tensor D on device
{{alpha, beta}}
}};
You are using nchw, not nhwc
so i can't use NCHW layout data in Conv computation?
It is most efficient to use NHWC on GPUs. You can convert NCHW to NHWC by using our utility: https://github.com/NVIDIA/cutlass/blob/master/tools/util/include/cutlass/util/device_nchw_to_nhwc.h
thank you, i'll take a look
alright, now i can have conv with normal linearcombination epilogue works. But when i switch to other epilogues like linearcombinationbiasrelu it still pops out errors. The only difference is that i replace epilogure with linearcombinationbiasrelu as below:
using Conv2dFpropKernel = typename cutlass::conv::kernel::DefaultConv2dFprop<
float,
cutlass::layout::TensorNHWC,
float,
cutlass::layout::TensorNHWC,
float,
cutlass::layout::TensorNHWC,
float,
cutlass::arch::OpClassTensorOp,
cutlass::arch::Sm80,
cutlass::gemm::GemmShape<64, 64, 32>,
cutlass::gemm::GemmShape<32, 32, 32>,
cutlass::gemm::GemmShape<16, 8, 8>,
cutlass::epilogue::thread::LinearCombinationBiasRelu<
float,
float,
float,
float,
4
>,
cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<4>,
4,
cutlass::arch::OpMultiplyAdd,
cutlass::conv::IteratorAlgorithm::kOptimized,
cutlass::conv::StrideSupport::kStrided,
4,
4
>::Kernel;
the argument of the ops is as below
typename Conv2dFprop::Arguments arguments{{
problem_size,
tensor_a, // <- reference to tensor A on device
tensor_b, // <- reference to tensor B on device
// tensor C is treated as the bias vector. We can enable the CONV
// to project away the N, H, W dimension by setting the stride to zero.
{{tensor_bias.data(), LayoutC::Stride(0)}},
tensor_d, // <- reference to tensor D on device
{{alpha, beta}}
}};
it just says the arguments are not matched.
Can you paste the error?
cutlass/cutlass/include/cutlass/epilogue/threadblock/epilogue.h(500): error: no instance of overloaded function "cutlass::epilogue::thread::LinearCombinationBiasRelu<ElementC_, ElementAccumulator_, ElementCompute_, ElementZ_, ElementsPerAccess, StoreT>::operator() [with ElementC_=float, ElementAccumulator_=float, ElementCompute_=float, ElementZ_=float, ElementsPerAccess=4, StoreT=true]" matches the argument list
argument types are: (const cutlass::Array<float, 4, true>)
object type is: const cutlass::epilogue::thread::LinearCombinationBiasRelu<float, float, float, float, 4, true>
detected during:
instantiation of "void cutlass::epilogue::threadblock::Epilogue<Shape_, WarpMmaOperator_, PartitionsK, OutputTileIterator_, AccumulatorFragmentIterator_, WarpTileIterator_, SharedLoadIterator_, OutputOp_, Padding_, FragmentsPerPartition, IterationsUnroll>::apply_output_operator_source_not_needed_(OutputTileIterator_::Fragment &, const cutlass::epilogue::threadblock::Epilogue<Shape_, WarpMmaOperator_, PartitionsK, OutputTileIterator_, AccumulatorFragmentIterator_, WarpTileIterator_, SharedLoadIterator_, OutputOp_, Padding_, FragmentsPerPartition, IterationsUnroll>::OutputOp &, const SharedLoadIterator_::Fragment &) [with Shape_=cutlass::gemm::GemmShape<64, 64, 32>, WarpMmaOperator_=cutlass::gemm::warp::MmaTensorOp<cutlass::gemm::GemmShape<32, 32, 32>, float, cutlass::layout::RowMajorTensorOpMultiplicandCrosswise<32, 32>, float, cutlass::layout::ColumnMajorTensorOpMultiplicandCrosswise<32, 32>, float, cutlass::layout::RowMajor, cutlass::gemm::warp::MmaTensorOpPolicy<cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 8>, 32, cutlass::tfloat32_t, cutlass::layout::RowMajor, cutlass::tfloat32_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, cutlass::arch::OpMultiplyAdd>, cutlass::MatrixShape<1, 1>>, 1, false, __nv_bool>, PartitionsK=1, OutputTileIterator_=cutlass::epilogue::threadblock::PredicatedTileIterator<cutlass::epilogue::threadblock::OutputTileOptimalThreadMap<cutlass::epilogue::threadblock::OutputTileShape<64, 8, 2, 1, 1>, cutlass::epilogue::threadblock::OutputTileShape<1, 4, 1, 1, 4>, 128, 4, 32>, float, false, false>, AccumulatorFragmentIterator_=cutlass::epilogue::warp::FragmentIteratorTensorOp<cutlass::gemm::GemmShape<32, 32, 32>, cutlass::gemm::GemmShape<16, 8, 8>, float, cutlass::Array<float, 4, true>, cutlass::layout::RowMajor, cutlass::layout::RowMajor>, WarpTileIterator_=cutlass::epilogue::warp::TileIteratorTensorOp<cutlass::gemm::GemmShape<32, 32, 32>, cutlass::gemm::GemmShape<16, 8, 8>, float, cutlass::layout::RowMajor>, SharedLoadIterator_=cutlass::epilogue::threadblock::SharedLoadIterator<cutlass::epilogue::threadblock::OutputTileOptimalThreadMap<cutlass::epilogue::threadblock::OutputTileShape<64, 8, 2, 1, 1>, cutlass::epilogue::threadblock::OutputTileShape<1, 4, 1, 1, 4>, 128, 4, 32>::CompactedThreadMap, float, 16>, OutputOp_=cutlass::epilogue::thread::LinearCombinationBiasRelu<float, float, float, float, 4, true>, Padding_=cutlass::MatrixShape<0, 8>, FragmentsPerPartition=2, IterationsUnroll=1]"
(322): here
instantiation of "void cutlass::epilogue::threadblock::Epilogue<Shape_, WarpMmaOperator_, PartitionsK, OutputTileIterator_, AccumulatorFragmentIterator_, WarpTileIterator_, SharedLoadIterator_, OutputOp_, Padding_, FragmentsPerPartition, IterationsUnroll>::compute_source_not_needed_(const cutlass::epilogue::threadblock::Epilogue<Shape_, WarpMmaOperator_, PartitionsK, OutputTileIterator_, AccumulatorFragmentIterator_, WarpTileIterator_, SharedLoadIterator_, OutputOp_, Padding_, FragmentsPerPartition, IterationsUnroll>::OutputOp &, cutlass::epilogue::threadblock::Epilogue<Shape_, WarpMmaOperator_, PartitionsK, OutputTileIterator_, AccumulatorFragmentIterator_, WarpTileIterator_, SharedLoadIterator_, OutputOp_, Padding_, FragmentsPerPartition, IterationsUnroll>::OutputTileIterator, const cutlass::epilogue::threadblock::Epilogue<Shape_, WarpMmaOperator_, PartitionsK, OutputTileIterator_, AccumulatorFragmentIterator_, WarpTileIterator_, SharedLoadIterator_, OutputOp_, Padding_, FragmentsPerPartition, IterationsUnroll>::AccumulatorTile &) [with Shape_=cutlass::gemm::GemmShape<64, 64, 32>, WarpMmaOperator_=cutlass::gemm::warp::MmaTensorOp<cutlass::gemm::GemmShape<32, 32, 32>, float, cutlass::layout::RowMajorTensorOpMultiplicandCrosswise<32, 32>, float, cutlass::layout::ColumnMajorTensorOpMultiplicandCrosswise<32, 32>, float, cutlass::layout::RowMajor, cutlass::gemm::warp::MmaTensorOpPolicy<cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 8>, 32, cutlass::tfloat32_t, cutlass::layout::RowMajor, cutlass::tfloat32_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, cutlass::arch::OpMultiplyAdd>, cutlass::MatrixShape<1, 1>>, 1, false, __nv_bool>, PartitionsK=1, OutputTileIterator_=cutlass::epilogue::threadblock::PredicatedTileIterator<cutlass::epilogue::threadblock::OutputTileOptimalThreadMap<cutlass::epilogue::threadblock::OutputTileShape<64, 8, 2, 1, 1>, cutlass::epilogue::threadblock::OutputTileShape<1, 4, 1, 1, 4>, 128, 4, 32>, float, false, false>, AccumulatorFragmentIterator_=cutlass::epilogue::warp::FragmentIteratorTensorOp<cutlass::gemm::GemmShape<32, 32, 32>, cutlass::gemm::GemmShape<16, 8, 8>, float, cutlass::Array<float, 4, true>, cutlass::layout::RowMajor, cutlass::layout::RowMajor>, WarpTileIterator_=cutlass::epilogue::warp::TileIteratorTensorOp<cutlass::gemm::GemmShape<32, 32, 32>, cutlass::gemm::GemmShape<16, 8, 8>, float, cutlass::layout::RowMajor>, SharedLoadIterator_=cutlass::epilogue::threadblock::SharedLoadIterator<cutlass::epilogue::threadblock::OutputTileOptimalThreadMap<cutlass::epilogue::threadblock::OutputTileShape<64, 8, 2, 1, 1>, cutlass::epilogue::threadblock::OutputTileShape<1, 4, 1, 1, 4>, 128, 4, 32>::CompactedThreadMap, float, 16>, OutputOp_=cutlass::epilogue::thread::LinearCombinationBiasRelu<float, float, float, float, 4, true>, Padding_=cutlass::MatrixShape<0, 8>, FragmentsPerPartition=2, IterationsUnroll=1]"
(196): here
instantiation of "void cutlass::epilogue::threadblock::Epilogue<Shape_, WarpMmaOperator_, PartitionsK, OutputTileIterator_, AccumulatorFragmentIterator_, WarpTileIterator_, SharedLoadIterator_, OutputOp_, Padding_, FragmentsPerPartition, IterationsUnroll>::operator()(const cutlass::epilogue::threadblock::Epilogue<Shape_, WarpMmaOperator_, PartitionsK, OutputTileIterator_, AccumulatorFragmentIterator_, WarpTileIterator_, SharedLoadIterator_, OutputOp_, Padding_, FragmentsPerPartition, IterationsUnroll>::OutputOp &, cutlass::epilogue::threadblock::Epilogue<Shape_, WarpMmaOperator_, PartitionsK, OutputTileIterator_, AccumulatorFragmentIterator_, WarpTileIterator_, SharedLoadIterator_, OutputOp_, Padding_, FragmentsPerPartition, IterationsUnroll>::OutputTileIterator, const cutlass::epilogue::threadblock::Epilogue<Shape_, WarpMmaOperator_, PartitionsK, OutputTileIterator_, AccumulatorFragmentIterator_, WarpTileIterator_, SharedLoadIterator_, OutputOp_, Padding_, FragmentsPerPartition, IterationsUnroll>::AccumulatorTile &, cutlass::epilogue::threadblock::Epilogue<Shape_, WarpMmaOperator_, PartitionsK, OutputTileIterator_, AccumulatorFragmentIterator_, WarpTileIterator_, SharedLoadIterator_, OutputOp_, Padding_, FragmentsPerPartition, IterationsUnroll>::OutputTileIterator) [with Shape_=cutlass::gemm::GemmShape<64, 64, 32>, WarpMmaOperator_=cutlass::gemm::warp::MmaTensorOp<cutlass::gemm::GemmShape<32, 32, 32>, float, cutlass::layout::RowMajorTensorOpMultiplicandCrosswise<32, 32>, float, cutlass::layout::ColumnMajorTensorOpMultiplicandCrosswise<32, 32>, float, cutlass::layout::RowMajor, cutlass::gemm::warp::MmaTensorOpPolicy<cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 8>, 32, cutlass::tfloat32_t, cutlass::layout::RowMajor, cutlass::tfloat32_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, cutlass::arch::OpMultiplyAdd>, cutlass::MatrixShape<1, 1>>, 1, false, __nv_bool>, PartitionsK=1, OutputTileIterator_=cutlass::epilogue::threadblock::PredicatedTileIterator<cutlass::epilogue::threadblock::OutputTileOptimalThreadMap<cutlass::epilogue::threadblock::OutputTileShape<64, 8, 2, 1, 1>, cutlass::epilogue::threadblock::OutputTileShape<1, 4, 1, 1, 4>, 128, 4, 32>, float, false, false>, AccumulatorFragmentIterator_=cutlass::epilogue::warp::FragmentIteratorTensorOp<cutlass::gemm::GemmShape<32, 32, 32>, cutlass::gemm::GemmShape<16, 8, 8>, float, cutlass::Array<float, 4, true>, cutlass::layout::RowMajor, cutlass::layout::RowMajor>, WarpTileIterator_=cutlass::epilogue::warp::TileIteratorTensorOp<cutlass::gemm::GemmShape<32, 32, 32>, cutlass::gemm::GemmShape<16, 8, 8>, float, cutlass::layout::RowMajor>, SharedLoadIterator_=cutlass::epilogue::threadblock::SharedLoadIterator<cutlass::epilogue::threadblock::OutputTileOptimalThreadMap<cutlass::epilogue::threadblock::OutputTileShape<64, 8, 2, 1, 1>, cutlass::epilogue::threadblock::OutputTileShape<1, 4, 1, 1, 4>, 128, 4, 32>::CompactedThreadMap, float, 16>, OutputOp_=cutlass::epilogue::thread::LinearCombinationBiasRelu<float, float, float, float, 4, true>, Padding_=cutlass::MatrixShape<0, 8>, FragmentsPerPartition=2, IterationsUnroll=1]"
The error is saying that LinearCombinationBiasRelu
does not have operator()
that only takes one parameter which is true. LinearCombinationBiasRelu
requires multiple sources and outputs. You need to use this epilogue: https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/epilogue/threadblock/epilogue_with_broadcast.h which means you need to use DefaultConv2dFpropWithBroadcast
. Take a look at this gemm unit test https://github.com/NVIDIA/cutlass/blob/master/test/unit/gemm/device/gemm_with_broadcast_f16n_f16n_f16n_tensorop_f32_sm75.cu#L254-L286 and this conv unit test https://github.com/NVIDIA/cutlass/blob/master/test/unit/conv/device/conv2d_fprop_with_broadcast_sm75.cu#L48-L90 and their associate testbeds.
i'll try it out, thanks
well, i passed the compilation but when running it meets cutass internal error 309. I've changed the epilogue to the below:
using EpilogueOutputOp = cutlass::epilogue::thread::LinearCombinationBiasElementwise<
float,
float,
float,
float,
float,
4,
cutlass::epilogue::thread::ReLu<float>
>;
using Conv2dFpropKernel = typename cutlass::conv::kernel::DefaultConv2dFpropWithBroadcast<
ElementA, cutlass::layout::TensorNHWC,
ElementB, cutlass::layout::TensorNHWC,
ElementC, cutlass::layout::TensorNHWC,
ElementAccumulator,
cutlass::arch::OpClassTensorOp,
cutlass::arch::Sm75,
cutlass::gemm::GemmShape<128, 128, 32>,
cutlass::gemm::GemmShape<64, 64, 32>,
cutlass::gemm::GemmShape<16, 8, 8>,
EpilogueOutputOp,
cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>,
4,
cutlass::arch::OpMultiplyAdd,
cutlass::conv::IteratorAlgorithm::kAnalytic
>::Kernel;
I check the testbed you mentioned and found out that my arguments may be wrong. Could you give me some hints?
typename Conv2dFprop::Arguments arguments{{
problem_size,
tensor_a, // <- reference to tensor A on device
tensor_b, // <- reference to tensor B on device
// tensor C is treated as the bias vector. We can enable the CONV
// to project away the N, H, W dimension by setting the stride to zero.
{{tensor_bias.data(), LayoutC::Stride(0)}},
tensor_d, // <- reference to tensor D on device
{{alpha, beta}}
}};
may be input size and kernel size not right? My kernel size is (512,3,3,512) input size is (1,7,7,512). The error happend in the line that initialize the op:
status = implicit_gemm_op.initialize(arguments, workspace.get());
If you use DefaultConv2dFpropWithBroadcast
, the ctor of Arguments
is like https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/conv/kernel/implicit_gemm_convolution_with_fused_epilogue.h#L178-L190 . Now you do not provide ptr_Vector
and ptr_Tensor
. ptr_tensor
is to load an additional vector. Is that what you want?
Is that two tensors necessary? I set them to nullptr and it still fails in initial the arguments.
In fact, my purpose is to simply add a epilogue with linear combination relu operation after Implicit GEMM AX+B, i wonder is there a more simple way?
then you just need to use https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/numeric_conversion.h#L66-L70 with DefaultConv2dFprop
i dont know if i understand it correctly, should i add this conversion object to the LinearCombinationBiasRelu epilogues?
Sorry, I pasted wrong links.
On top of https://github.com/NVIDIA/cutlass/issues/509#issuecomment-1157222714 , change LinearCombinationBiasRelu
to LinearCombinationRelu
thank you, i passed the running test. Now I'm working on the nchw_to_nhwc kernels. I found out that the Tensor4D's n,h,w,c dimension are designed for NHWC format which means the c() will return the last dimension. So if I want to change a (1,512,7,7) NCHW input feature to (1,7,7,512) NHWC, my code is like below:
cutlass::nchw_to_nhwc<float>({{1, 7, 7, 512}},
{{1, 512, 7, 7}},
tensor_at,
tensor_a,
stream);
then the answer of the same conv is wrong. Is my use of this conversion function right?
https://github.com/NVIDIA/cutlass/blob/master/tools/util/include/cutlass/util/device_nchw_to_nhwc.h#L117-L122
you code looks wrong to me. the input and output orders are wrong.
thanks, i'll try it.