taichi icon indicating copy to clipboard operation
taichi copied to clipboard

Android Aot Vulkan GPU Performance

Open helloworldstone opened this issue 2 years ago • 4 comments

hi, everyone!

I write an aot vulkan demo on Android like this https://github.com/taichi-dev/taichi-aot-demo/blob/master/implicit_fem/include/fem_app.h,and compiled an binary executable file with cmake "add_executable".

When I run the binary executable file, I found the vulkan “kernel” performance is much worse than opencl. “Kernel” running time like this(running 100 times to get the average ): vulkan_time = launch+synchronize opencl_time = clEnqueueNDRangeKernel + clFinish

I found one of the reasons is GPU frequency. When the vulkan demo runs, the GPU frequency is at half of the highest frequency. So,

  1. Is there a taichi c++ api to boost GPU frequency?
  2. Except the GPU frequency, is there anything else I should do to improve vulkan "kernel" performance?
  3. Do you have a plan to support Opencl?

Thanks a lot! Looking forward to your reply!

helloworldstone avatar Aug 09 '22 03:08 helloworldstone

Hey @helloworldstone , wow it's pretty impressive to see you get it up running! We haven't done any benchmarks comparing to opencl yet, would you mind sharing your benchmark repo so that we can give it a try as well? Thanks a lot! cc: @turbo0628 who might be interested :D

ailzhang avatar Aug 09 '22 03:08 ailzhang

hi, @ailzhang ,sorry about it,I don't have a benchmark repo. I paste the code below.

On Snapdragon888, opencl kernel cost 2.21 ms, vulkan kernel cost 8.02ms.

C++:

int main() { const std::vectorstd::string extensions = { VK_KHR_SURFACE_EXTENSION_NAME, VK_KHR_ANDROID_SURFACE_EXTENSION_NAME, VK_KHR_GET_PHYSICAL_DEVICE_PROPERTIES_2_EXTENSION_NAME, }; // Create a Vulkan Device taichi::lang::vulkan::VulkanDeviceCreator::Params evd_params; evd_params.api_version = VK_API_VERSION_1_2; evd_params.additional_instance_extensions = extensions; evd_params.additional_device_extensions = {VK_KHR_SWAPCHAIN_EXTENSION_NAME}; evd_params.is_for_ui = false; evd_params.surface_creator = nullptr; std::unique_ptrtaichi::lang::vulkan::VulkanDeviceCreator embedded_device_ = std::make_uniquetaichi::lang::vulkan::VulkanDeviceCreator(evd_params);

taichi::lang::vulkan::VulkanDevice* device_ = static_cast<taichi::lang::vulkan::VulkanDevice*>(
    embedded_device_->device());

// Initialize our Vulkan Program pipeline
std::vector<uint64_t> host_result_buffer_;
host_result_buffer_.resize(taichi_result_buffer_entries);
taichi::lang::gfx::GfxRuntime::Params params;
params.host_result_buffer = host_result_buffer_.data();
params.device = embedded_device_->device();
std::unique_ptr<taichi::lang::gfx::GfxRuntime> vulkan_runtime_ =
    std::make_unique<taichi::lang::gfx::GfxRuntime>(std::move(params));

// Get kernels
std::string shader_source = "/data/local/tmp/test_demo/aot";
taichi::lang::gfx::AotModuleParams aot_params{shader_source, vulkan_runtime_.get()};
std::unique_ptr<taichi::lang::aot::Module> module_ = taichi::lang::aot::Module::load(taichi::Arch::vulkan, aot_params);
auto root_size = module_->get_root_size();
vulkan_runtime_->add_root_buffer(root_size);
taichi::lang::aot::Kernel* fast_gauss_kernel = module_->get_kernel("fast_gauss");

const int imgW = 1080;
const int imgH = 1920;
// Prepare Ndarray for model
taichi::lang::Device::AllocParams alloc_params;
alloc_params.usage = taichi::lang::AllocUsage::Storage;
alloc_params.host_write = true;
alloc_params.host_read  = false;
alloc_params.size = imgW * imgH * sizeof(float);
taichi::lang::DeviceAllocation devalloc_inImg_  = device_->allocate_memory(alloc_params);

alloc_params.size = 25 * sizeof(float);
taichi::lang::DeviceAllocation devalloc_gaussCoefs  = device_->allocate_memory(alloc_params);

alloc_params.host_write = false;
alloc_params.host_read  = true;
alloc_params.size = imgW * imgH * sizeof(float);
taichi::lang::DeviceAllocation devalloc_outImg_ = device_->allocate_memory(alloc_params);

float coefs[25] = 
{0.00390625, 0.015625, 0.0234375,0.015625,0.00390625,
0.015625,0.0625,0.09375,0.0625,0.015625,
0.0234375,0.09375,0.140625,0.09375,0.0234375,
0.015625,0.0625,0.09375,0.0625,0.015625,
0.00390625,0.015625,0.0234375,0.015625,0.00390625};

float* tmpBuf    = (float*)malloc(imgW * imgH * sizeof(float));
float* tmpBufOut = (float*)malloc(imgW * imgH * sizeof(float));
for(int i =0; i < imgW * imgH; i++) {
    tmpBuf[i] = i % 256;
}

load_data(vulkan_runtime_.get(), devalloc_inImg_, tmpBuf, imgW * imgH * sizeof(float));
load_data(vulkan_runtime_.get(), devalloc_gaussCoefs, coefs, 5 * 5 * sizeof(float));
vulkan_runtime_->synchronize();

taichi::lang::RuntimeContext host_ctx_;
memset(&host_ctx_, 0, sizeof(taichi::lang::RuntimeContext));
host_ctx_.result_buffer = host_result_buffer_.data();

float time_use_run; 
struct timeval tpstart_run, tpend_run; 
gettimeofday(&tpstart_run, nullptr);
for(int i =0; i < 100; i++) {
    host_ctx_.set_arg_devalloc(0, devalloc_inImg_, {imgH, imgW});
    host_ctx_.set_arg_devalloc(1, devalloc_outImg_, {imgH, imgW});
    host_ctx_.set_arg_devalloc(2, devalloc_gaussCoefs, {5, 5});
    fast_gauss_kernel->launch(&host_ctx_);
    vulkan_runtime_->synchronize();
}
gettimeofday(&tpend_run, nullptr); 
time_use_run = 1000000 * (tpend_run.tv_sec - tpstart_run.tv_sec) + tpend_run.tv_usec - tpstart_run.tv_usec; 
printf("[vulkan run time] = %40f ms. \n", time_use_run / 1000 / 100);

//debug
read_data(vulkan_runtime_.get(), devalloc_outImg_, tmpBufOut, imgW * imgH * sizeof(float));
vulkan_runtime_->synchronize();
int startIdx = imgH * imgW / 2 + imgW / 2;
for(int i = startIdx ; i < (startIdx + 10); i++) {
    float* tmpPtr  = (float*)tmpBufOut;
    printf("idx  = %d val = %f \n", i, tmpPtr[i]);

}
device_->dealloc_memory(devalloc_inImg_);
device_->dealloc_memory(devalloc_outImg_);
device_->dealloc_memory(devalloc_gaussCoefs);

free(tmpBuf);    tmpBuf = nullptr;
free(tmpBufOut); tmpBufOut = nullptr;
return 0;

}

Python taichi kernel: height = 1920 width = 1080 inputImg = ti.ndarray(dtype=ti.f32, shape=(height, width)) outputImg = ti.ndarray(dtype=ti.f32, shape=(height, width)) kernelCoffes = ti.ndarray(dtype=ti.f32, shape=(5, 5)) @ti.kernel def fast_gauss(inputImg: ti.types.ndarray(), outputImg: ti.types.ndarray(), kernelCoffes: ti.types.ndarray()): for pi,pj in ti.ndrange(height -2 , width -2): if (pi >= 2 and pj >= 2): outputImg[pi, pj] =(inputImg[pi + 0 - 2, pj + 0 - 2] * kernelCoffes[0, 0] + inputImg[pi + 0 - 2, pj + 1 - 2] * kernelCoffes[0, 1] + inputImg[pi + 0 - 2, pj + 2 - 2] * kernelCoffes[0, 2] + inputImg[pi + 0 - 2, pj + 3 - 2] * kernelCoffes[0, 3] + inputImg[pi + 0 - 2, pj + 4 - 2] * kernelCoffes[0, 4] + inputImg[pi + 1 - 2, pj + 0 - 2] * kernelCoffes[1, 0] + inputImg[pi + 1 - 2, pj + 1 - 2] * kernelCoffes[1, 1] + inputImg[pi + 1 - 2, pj + 2 - 2] * kernelCoffes[1, 2] + inputImg[pi + 1 - 2, pj + 3 - 2] * kernelCoffes[1, 3] + inputImg[pi + 1 - 2, pj + 4 - 2] * kernelCoffes[1, 4] + inputImg[pi + 2 - 2, pj + 0 - 2] * kernelCoffes[2, 0] + inputImg[pi + 2 - 2, pj + 1 - 2] * kernelCoffes[2, 1] + inputImg[pi + 2 - 2, pj + 2 - 2] * kernelCoffes[2, 2] + inputImg[pi + 2 - 2, pj + 3 - 2] * kernelCoffes[2, 3] + inputImg[pi + 2 - 2, pj + 4 - 2] * kernelCoffes[2, 4] + inputImg[pi + 3 - 2, pj + 0 - 2] * kernelCoffes[3, 0] + inputImg[pi + 3 - 2, pj + 1 - 2] * kernelCoffes[3, 1] + inputImg[pi + 3 - 2, pj + 2 - 2] * kernelCoffes[3, 2] + inputImg[pi + 3 - 2, pj + 3 - 2] * kernelCoffes[3, 3] + inputImg[pi + 3 - 2, pj + 4 - 2] * kernelCoffes[3, 4] + inputImg[pi + 4 - 2, pj + 0 - 2] * kernelCoffes[4, 0] + inputImg[pi + 4 - 2, pj + 1 - 2] * kernelCoffes[4, 1] + inputImg[pi + 4 - 2, pj + 2 - 2] * kernelCoffes[4, 2] + inputImg[pi + 4 - 2, pj + 3 - 2] * kernelCoffes[4, 3] + inputImg[pi + 4 - 2, pj + 4 - 2] * kernelCoffes[4, 4]) `

helloworldstone avatar Aug 09 '22 09:08 helloworldstone

For problem 1 it's something we encountered ourselves during testing. On mobile it seems the OS sometimes just doesn't want to treat our application as interactive or game and do not boost performance, and I don't think we can provide an API to pin the frequency (not possible).

For 2, in general try to avoid the "synchronize" call. This is a major factor in slowing down everything. Other performance tips very much vary by each program.

For 3, not sure we have a plan yet. OpenCL is quite "dead" as an API rn and support from hardware vendors is sparse

bobcao3 avatar Aug 10 '22 23:08 bobcao3

@bobcao3 OK,thanks a lot!

helloworldstone avatar Aug 11 '22 13:08 helloworldstone