Halide
Halide copied to clipboard
GPU memory model is poorly documented. Copy_to_device does nothing unless you set_host_dirty.
I was following the tutorials and since my Windows PC doesn't has libJpeg or libPng, I decided to use OpenCV for image I/O. Then I learnt that in Halide, RGB buffer is default to be planar, and I checked out the test_interleaved.cpp, and write two small tools to convert between an interleaved buffer and a planar buffer. Which works fine in CPU version, like lesson_07.
However, when it came to lesson_12, the GPU part, the result is all 0 in the GPU version pipeline.
More specifically, I tried two strategies:
- add two Func to convert a interleaved input the planar and a Func to convert back to interleaved at the beginning and the end of the original pipeline.
- use two additional pipelines. The first one converts an interleaved buffer to a planar buffer first, with realize() called, then feed the planar buffer to the untouched original pipeline. Then add another pipeline to convert the planar result to an interleaved buffer and then create the corresponding cv::Mat form it.
These two strategies works fines on CPU. But on GPU, the first one (use two more Func) is too slow, so I tried the second strategy. However, the result is all 0.
I have tried copy_to_device() copy_to_host before and after the realize(), but it seems make no difference.
The function I use to convert buffer in below
template <class T>
Halide::Buffer<T> interleaved2planar(Halide::Buffer<T> in) {
ImageParam src(in.type(), 3);
Halide::Func plane("plane");
Halide::Var x, y, c;
plane(x, y, c) = src(x, y, c);
src.dim(0).set_stride(3).dim(2).set_stride(1).set_bounds(0, 3);
plane.output_buffer().dim(0).set_stride(1).dim(2).set_extent(3);
plane.reorder(c, x, y).unroll(c);
// plane.vectorize(x, 64);
src.set(in);
plane.compile_jit();
auto out_buf = Halide::Buffer<T>(in.width(), in.height(), 3);
plane.realize(out_buf);
return out_buf;
}
and this is how I use it:
auto mat = cv::imread(data_root + "images/rgb.png");
Buffer<uint8_t> interleaved_in = Buffer<uint8_t>::make_interleaved(mat.data, mat.cols, mat.rows, mat.channels());
Buffer<uint8_t> input1 = interleaved2planar(interleaved_in);
Try set_host_dirty() on your input instead of copy_to_device()
Halide might not realize that the device allocation for interleaved_in doesn't contain the most recent copy of the data. copy_to_device does nothing if the buffer is not marked as being dirty on the host.
Try set_host_dirty() on your input instead of copy_to_device()
Halide might not realize that the device allocation for interleaved_in doesn't contain the most recent copy of the data. copy_to_device does nothing if the buffer is not marked as being dirty on the host.
Thanks, I'll try that.
Where can I find the usage of these methods like set_host_dirty() and set_device_dirty() . I cant find any in the comments/documentation.
Reopening as a missing-documentation bug.
Part of the problem here is that non-GPU halide pipelines don't bother to set host dirty on the outputs. They should (it's cheap!)
I just spent two weeks trying to figure this out. I finally end up here after stumbling through the Glitter chat and this stackoverflow post. I agree the documentation on the GPU memory model should be improved, had no idea that I had to do this.
Minor update: non-GPU pipelines now both set host dirty on the outputs and assert !device_dirty on the inputs.
See also #4600 for an instance of this issue
Sorry to bump a nearly 3 year old thread but I was wondering if there has been any progress to document the GPU memory model in either the Halide tutorials or READMEs in the repo?
No, and it's really inexcusable on our part. We need to get someone to step up and make this happen.
learn it