cutlass [QST] Datatype conversion for tensor on GPU

[QST] Datatype conversion for tensor on GPU

Open nolyn opened this issue 2 years ago • 3 comments

I'm trying to convert the data type of a tensor on GPU, I think this should be faster than on CPU. Also I'll need to do it several time between different convolutions and syncing to CPU and doing the conversion there would harm performance a lot.

I had a look at #184 and am trying to implement it as outlined there. My use case is conversion of int32_t to int8_t. So far I managed to do it for a number of elements known at compile time:

namespace test {
namespace core {
namespace kernel {
template <typename Destination, typename Source, int Count>
__global__ void convert(
  cutlass::Array<Destination, Count> *destination, 
  cutlass::Array<Source, Count> const *source) {

  cutlass::NumericArrayConverter<Destination, Source, Count> convert;

  *destination = convert(*source);
}

} // namespace kernel
} // namespace core
} // namespace test

template <
  /// Data type of element stored within tensor (concept: NumericType)
  typename ElementIn_,
  /// Defines a mapping from logical coordinate to linear memory (concept: Layout)
  typename LayoutIn_,
  /// Data type of element stored within tensor (concept: NumericType)
  typename ElementOut_,
  /// Defines a mapping from logical coordinate to linear memory (concept: Layout)
  typename LayoutOut_,

  int kN
>
cutlass::HostTensor<ElementOut_, LayoutOut_> convertDataType(cutlass::HostTensor<ElementIn_, LayoutIn_> source)
{
  cutlass::HostTensor<ElementOut_, LayoutOut_> destination { source.extent()};
  // auto shape = source.
  assert( kN == source.size());

  dim3 grid(1, 1);
  dim3 block(1, 1);

  source.sync_device();

  test::core::kernel::convert<ElementOut_, ElementIn_, kN><<< grid, block >>>(
    reinterpret_cast<cutlass::Array<ElementOut_, kN> *>(destination.device_data()),
    reinterpret_cast<cutlass::Array<ElementIn_, kN> const *>(source.device_data())
  );

  destination.sync_host();

  return destination;
  
}

Now, I'm trying to make it work for arbitrary number of elements, I believe this is where iterators are required? I'm trying to make it work using example 04 - tile iterator as a starting point:

template <typename Iterator>
__global__ void converter2(
    typename Iterator::Params dst_params,
    typename Iterator::Element *dst_pointer,
    typename Iterator::Params src_params,
    typename Iterator::Element *src_pointer,
    cutlass::Coord<2> extent) {

    Iterator dst_iterator(dst_params, dst_pointer, extent, threadIdx.x);
    Iterator src_iterator(src_params, src_pointer, extent, threadIdx.x);

    cutlass::NumericArrayConverter<
    int32_t, 
    int8_t, 
    Iterator::Fragment::kElements> converter;

    // PredicatedTileIterator uses PitchLinear layout and therefore takes in a PitchLinearShape.
    // The contiguous dimension can be accessed via Iterator::Shape::kContiguous and the strided
    // dimension can be accessed via Iterator::Shape::kStrided
    int iterations = (extent[1] + Iterator::Shape::kStrided - 1) / Iterator::Shape::kStrided;

    typename Iterator::Fragment fragment;

    for(int i = 0; i < fragment.size(); ++i) {
      fragment[i] = 0;
    }

    src_iterator.load(fragment);
    auto tmp = converter(fragment);
    dst_iterator.store(tmp);

    ++src_iterator;
    ++dst_iterator;

    for(; iterations > 1; --iterations) {

      src_iterator.load(fragment);
      dst_iterator.store(fragment);

      ++src_iterator;
      ++dst_iterator;
    }
}

cutlass::HostTensor<int8_t, LayoutInput> TestTileIteratorNumericConversion(cutlass::HostTensor<int32_t, LayoutInput> src_tensor) {
    using ElementIn_ = int32_t;
    using LayoutIn_ = LayoutInput;
    using ElementOut_ = int8_t;
    using LayoutOut_ = LayoutInput;

    // For this example, we chose a <64, 4> tile shape. The PredicateTileIterator expects
    // PitchLinearShape and PitchLinear layout.
    using Shape = cutlass::layout::PitchLinearShape<64, 4>;
    // using Layout = cutlass::layout::PitchLinear;
    using Element = ElementIn_;
    int const kThreads = 32;

    // ThreadMaps define how threads are mapped to a given tile. The PitchLinearStripminedThreadMap
    // stripmines a pitch-linear tile among a given number of threads, first along the contiguous
    // dimension then along the strided dimension.
    using ThreadMap = cutlass::transform::PitchLinearStripminedThreadMap<Shape, kThreads>;

    // Define the PredicateTileIterator, using TileShape, Element, Layout, and ThreadMap types
    using Iterator = cutlass::transform::threadblock::PredicatedTileIterator<
        Shape, Element, LayoutOut_, 1, ThreadMap>;

    auto source_extent = src_tensor.extent();
    cutlass::HostTensor<ElementOut_, LayoutOut_> dst_tensor { src_tensor.extent()};

    dst_tensor.sync_device();
    src_tensor.sync_device();

    typename Iterator::Params dst_params(dst_tensor.layout());
    typename Iterator::Params src_params(src_tensor.layout());

    dim3 block(kThreads, 1);
    dim3 grid(1, 1);

    converter2<Iterator><<< grid, block >>>(
            dst_params,
            dst_tensor.device_data(),
            src_params,
            src_tensor.device_data(),
            source_extent
    );

    cudaError_t result = cudaGetLastError();
    if(result != cudaSuccess) {
      std::cerr << "Error - kernel failed." << std::endl;
      return dst_tensor;
    }

    dst_tensor.sync_host();

    return dst_tensor;
}

Some issues with this, to which I didn't get yet:

I think I'll need 2 iterations, one for each data type
extent passed to converter2 is 4-dimensional for me. not sure how to adjust that.

However, what I'm stuck with before I can get to this is following error:

[build] ../Network.cu(230): error: incomplete type is not allowed
[build] 
[build] ../Network.cu(231): error: incomplete type is not allowed
[build] 
[build] ../Network.cu(238): error: no instance of function template "converter2" matches the argument list
[build]             argument types are: (<error-type>, int8_t *, <error-type>, int32_t *, cutlass::layout::TensorNHWC::TensorCoord)

The first two are thrown for

    typename Iterator::Params dst_params(dst_tensor.layout());
    typename Iterator::Params src_params(src_tensor.layout());

The last is caused by that too I believe, its thrown when dst_params/src_params is used.

As you can see, I removed some of the templated code. I was trying to narrow this down, but without success. If I can get it to work I'll try to add the templating back. Any suggestions? Thanks in advance!

Jun 13 '22 09:06 nolyn

I managed to do a conversion using something similar to the code below. Putting it here, so it may help others. It might be trivial if you have some experience with cuda, which I do not. I used device_nchw_to_nhwc.h as guide to do this. No idea if it performs optimal, but it certainly beats syncing to host for conversion.


template <typename T>
void convert_on_device(cutlass::Tensor4DCoord input_tensor_size,
                  cutlass::Tensor4DCoord output_tensor_size,
                  TensorRef<T, layout::TensorNHWC> ref_input,
                  TensorRef<T, layout::TensorNHWC> ref_output,
                  cudaStream_t stream);


template <typename T>
__global__ void convert_on_device_kernel(const int32_t n,
                                    const int32_t h,
                                    const int32_t w,
                                    const int32_t c_in,
                                    const int32_t c_out,
                                    const T zero,
                                    const T *input,
                                    T *output){

  const int32_t idx_jump       = blockDim.x * gridDim.x;
  const int32_t total_elements = n * h * w * c_out;


  T value;
  for (int32_t idx = blockIdx.x * blockDim.x + threadIdx.x; idx < total_elements; idx += idx_jump) {
    output[idx] = static_cast<T>(  input[idx] );
  }
}


template <typename T>
void convert_on_device(cutlass::Tensor4DCoord input_tensor_size,
                  cutlass::Tensor4DCoord output_tensor_size,
                  TensorRef<T, layout::TensorNHWC> ref_input,
                  TensorRef<T, layout::TensorNHWC> ref_output,
                  cudaStream_t stream){
  assert(
    input_tensor_size.n() == output_tensor_size.n() &&
    input_tensor_size.h() == output_tensor_size.h() &&
    input_tensor_size.w() == output_tensor_size.w() &&
    input_tensor_size.c() <= output_tensor_size.c()); 
    
  int n = input_tensor_size.n();
  int h = input_tensor_size.h();
  int w = input_tensor_size.w();
  int c_in = input_tensor_size.c();
  int c_out = output_tensor_size.c();
    
  int32_t total_elements = n * h * w * c_out;
  int block_size = 256;
  dim3 grid((total_elements + 255)/256);
  dim3 block(block_size);
  const T zero = static_cast<T>(0.0f);
  convert_on_device_kernel<<<grid, block, 0, stream>>>(n, h, w, c_in, c_out, zero, ref_input.data(), ref_output.data());

}

Jun 23 '22 09:06 nolyn

I'm now trying to do something similar where I have tensors of different shape (one is basically a vector). Is it possible to broadcast it? My first attempt of a kernel, where i simply calculate the index in the vector based on the index in the other tensor fails, I suspect it does not work since the kernel will process a whole block at once. I think the index calculation itself would be ok, I check with similar code on host.

Jun 23 '22 09:06 nolyn

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

Jul 23 '22 11:07 github-actions[bot]

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

Dec 16 '22 13:12 github-actions[bot]

Closing due to inactivity. Please reopen if needed.

Apr 27 '23 14:04 mnicely

cutlass cutlass copied to clipboard

[QST] Datatype conversion for tensor on GPU

cutlass
cutlass copied to clipboard