DCA
DCA copied to clipboard
Host to host copies trigger device synchronization.
All copies between Vector and Matrix objects are handled by cudaMemcpy with cudaMemcpyDefault. This kills our chances to execute CPU code in parallel with GPU memory transfers and kernel execution.
Relevant code: include/dca/linalg/util/copy.hpp
See: https://stackoverflow.com/questions/22430446/does-cuda-memcpy-from-host-to-host-perform-synchronization