Ye Luo

Results 357 comments of Ye Luo

@carlobertolli I got confused. Were you **not** asking performance tests to show the benefit of your optimization prototype?

Have you see huge reduction in a trace timeline with the following example ``` int a #pragma omp target map(tofrom: a) { a = a*2 } ``` Since the example...

@carlobertolli Here is the a benchmark you may try with from QMCPACK performance tests. performance-NiO-a64-e768-batched_driver-w16-DU32-1-4 test case Make a real+mixed precision build. Here is the recipe for `AOMP 14.0_1` ```...

From this offload region. https://github.com/ye-luo/miniqmc/blob/c0b6c89746e424f5b198bd63ad63dd5bb5cd12cf/src/QMCWaveFunctions/einspline_spo_omp.cpp#L413 I still see two synchronization(single_wait_scaquire) being used ![Screenshot from 2022-02-25 21-51-11](https://user-images.githubusercontent.com/1454251/155828050-02c187e2-23fe-49da-ae53-03b38a03fecd.png) However, with the CUDA plugin. only one synchronize is needed. ![Screenshot from 2022-02-25 21-49-20](https://user-images.githubusercontent.com/1454251/155828062-5e603afe-91e3-40f1-97a7-28009feaef80.png)

I'm thinking of leaving it open here. Once we improve the performance in upstream and propagate the change back to AOMP, then we close it here.

The command line on the first line in the description. I tried removing the offload options and get the following. ``` $ nm a.out |grep muldc U __muldc3@@GCC_4.0.0 ``` from...

11.7-1 works with default optimization, -O3, -O3 -ffast-math. As long as I add -g, the compiler stops.

FYI: From a miniQMC run ``` OMP_NUM_THREADS=8 rocprof --hsa-trace ./bin/check_spo -n 1 ``` 3916422 is not an OpenMP thread but it only calls many `hsa_system_get_info` in the initialization. I'm refering...

> It's not a fix for hipMemset, but you might be interested in hsa_amd_memory_fill as an alternative Thank you for the info. I don't really use hipMemset in the application...

In the application, I use memory ptr from omp_target_alloc to call hipblas or hip kernel. I was expecting hipMemSet as a convenient routine which ends up calling a kernel. It...