aomp
aomp copied to clipboard
[Issue][OMPT] Device-to-device transfers use wrong timestamps in device tracing interface
Problem Description
Just a few days ago, LLVM and the AMD fork included a fix from @mhalk for https://github.com/llvm/llvm-project/issues/66478, which caused data transfers between two devices to be reported as going through the host first. This fix generally works, but the timestamps for those records seem to be broken, causing them to use the timestamps of the last event instead.
This causes some funny results, like copying 4.65 GiB of data in just 9 microseconds, since this was the kernel execution time, resulting in a blazingly fast data rate of 505 TiB/s or copying a few bytes bring reported as taking several minutes.
I think that something broke while changing from the old behavior to the new one, though I haven't looked at the implementation yet to see where the issue could come from.
Operating System
Apptainer -- Ubuntu 22.04.3 LTS on JURECA-DC Evaluation Platform
CPU
2x AMD EPYC 7443 24-Core Processor
GPU
4x AMD Instinct MI250X
ROCm Version
ROCm 6.0.0
ROCm Component
aomp, llvm-project
Steps to Reproduce
Looking at the results with a simple reproducer, one can see the following results:
Init: device_num=0 type=gfx1101 device=0x475b40 lookup=0x7f07e79d28e0 doc=(nil)
Allocating Memory on Device
Callback DataOp EMI: endpoint=1 target_task_data=(nil) (0x0) target_data=0x7f07e803b7b0 (0x0) host_op_id=0x7f07e803b7a8 (0x8000000000000001) optype=1 src=(nil) src_device_num=1 dest=(nil) dest_device_num=0 bytes=4 code=0x20627b
Allocated 256 bytes at 0x3c4180 in buffer request callback
Callback DataOp EMI: endpoint=2 target_task_data=(nil) (0x0) target_data=0x7f07e803b7b0 (0x0) host_op_id=0x7f07e803b7a8 (0x8000000000000001) optype=1 src=(nil) src_device_num=1 dest=0x7f06df420000 dest_device_num=0 bytes=4 code=0x20627b
Testing: Host to Device
Callback DataOp EMI: endpoint=1 target_task_data=(nil) (0x0) target_data=0x7f07e803b7b0 (0x0) host_op_id=0x7f07e803b7a8 (0x8000000000000002) optype=2 src=0x47d3a0 src_device_num=1 dest=0x7f06df420000 dest_device_num=0 bytes=4 code=0x206323
Callback DataOp EMI: endpoint=2 target_task_data=(nil) (0x0) target_data=0x7f07e803b7b0 (0x0) host_op_id=0x7f07e803b7a8 (0x8000000000000002) optype=2 src=0x47d3a0 src_device_num=1 dest=0x7f06df420000 dest_device_num=0 bytes=4 code=0x206323
Testing: Device to Device
Callback DataOp EMI: endpoint=1 target_task_data=(nil) (0x0) target_data=0x7f07e803b7b0 (0x0) host_op_id=0x7f07e803b7a8 (0x8000000000000003) optype=3 src=0x7f06df420000 src_device_num=0 dest=0x7f06df420000 dest_device_num=0 bytes=4 code=0x2063b1
Allocated 256 bytes at 0x47df40 in buffer request callback
Callback DataOp EMI: endpoint=2 target_task_data=(nil) (0x0) target_data=0x7f07e803b7b0 (0x0) host_op_id=0x7f07e803b7a8 (0x8000000000000003) optype=3 src=0x7f06df420000 src_device_num=0 dest=0x7f06df420000 dest_device_num=0 bytes=4 code=0x2063b1
Testing: Device to Host
Callback DataOp EMI: endpoint=1 target_task_data=(nil) (0x0) target_data=0x7f07e803b7b0 (0x0) host_op_id=0x7f07e803b7a8 (0x8000000000000004) optype=3 src=0x7f06df420000 src_device_num=0 dest=0x47d3a0 dest_device_num=1 bytes=4 code=0x20643f
Callback DataOp EMI: endpoint=2 target_task_data=(nil) (0x0) target_data=0x7f07e803b7b0 (0x0) host_op_id=0x7f07e803b7a8 (0x8000000000000004) optype=3 src=0x7f06df420000 src_device_num=0 dest=0x47d3a0 dest_device_num=1 bytes=4 code=0x20643f
Checking Correctness
Freeing Memory on Device
Callback DataOp EMI: endpoint=1 target_task_data=(nil) (0x0) target_data=0x7f07e803b7b0 (0x0) host_op_id=0x7f07e803b7a8 (0x8000000000000005) optype=4 src=0x7f06df420000 src_device_num=0 dest=(nil) dest_device_num=-1 bytes=0 code=0x206509
Executing buffer complete callback: 0 0x3c4180 224 0x3c4180 0
rec=0x3c4180 type= 9 time=70992579988768 thread_id=1320323111743406870 target_id=0
Record Target data op: target_id=0x0 host_op_id=0x8000000000000001 optype=1 src_addr=(nil) src_device=1 dest_addr=0x7f06df420000 dest_device=0 bytes=4 end_time=70992579995640 duration=6872 ns codeptr=0x20627b
rec=0x3c41f0 type= 9 time=70992580935310 thread_id=1320323111743406870 target_id=0
Record Target data op: target_id=0x0 host_op_id=0x8000000000000002 optype=2 src_addr=0x47d3a0 src_device=1 dest_addr=0x7f06df420000 dest_device=0 bytes=4 end_time=70992580942030 duration=6720 ns codeptr=0x206323
Allocated 256 bytes at 0x47e240 in buffer request callback
Executing buffer complete callback: 0 0x3c4180 0 (nil) 1
Deallocated 0x3c4180
Callback DataOp EMI: endpoint=2 target_task_data=(nil) (0x0) target_data=0x7f07e803b7b0 (0x0) host_op_id=0x7f07e803b7a8 (0x8000000000000005) optype=4 src=0x7f06df420000 src_device_num=0 dest=(nil) dest_device_num=-1 bytes=0 code=0x206509
Executing buffer complete callback: 0 0x47df40 224 0x47df40 0
rec=0x47df40 type= 9 time=70992580935310 thread_id=1320323111743406870 target_id=0
Record Target data op: target_id=0x0 host_op_id=0x8000000000000003 optype=3 src_addr=0x7f06df420000 src_device=0 dest_addr=0x7f06df420000 dest_device=0 bytes=4 end_time=70992580942030 duration=6720 ns codeptr=0x2063b1
rec=0x47dfb0 type= 9 time=70992580981830 thread_id=1320323111743406870 target_id=0
Record Target data op: target_id=0x0 host_op_id=0x8000000000000004 optype=3 src_addr=0x7f06df420000 src_device=0 dest_addr=0x47d3a0 dest_device=1 bytes=4 end_time=70992580984910 duration=3080 ns codeptr=0x20643f
Executing buffer complete callback: 0 0x47df40 0 (nil) 1
Deallocated 0x47df40
Executing buffer complete callback: 0 0x47e240 112 0x47e240 0
rec=0x47e240 type= 9 time=70992580997175 thread_id=1320323111743406870 target_id=0
Record Target data op: target_id=0x0 host_op_id=0x8000000000000005 optype=4 src_addr=0x7f06df420000 src_device=0 dest_addr=(nil) dest_device=-1 bytes=0 end_time=70992580997817 duration=642 ns codeptr=0x206509
All important parts can be found at the end, where the buffer is evaluated. Here, we can the the following:
rec=0x3c4180 type= 9 time=70992579988768 thread_id=1320323111743406870 target_id=0
Record Target data op: target_id=0x0 host_op_id=0x8000000000000001 optype=1 src_addr=(nil) src_device=1 dest_addr=0x7f06df420000 dest_device=0 bytes=4 end_time=70992579995640 duration=6872 ns codeptr=0x20627b
rec=0x3c41f0 type= 9 time=70992580935310 thread_id=1320323111743406870 target_id=0
Record Target data op: target_id=0x0 host_op_id=0x8000000000000002 optype=2 src_addr=0x47d3a0 src_device=1 dest_addr=0x7f06df420000 dest_device=0 bytes=4 end_time=70992580942030 duration=6720 ns codeptr=0x206323
rec=0x47df40 type= 9 time=70992580935310 thread_id=1320323111743406870 target_id=0
Record Target data op: target_id=0x0 host_op_id=0x8000000000000003 optype=3 src_addr=0x7f06df420000 src_device=0 dest_addr=0x7f06df420000 dest_device=0 bytes=4 end_time=70992580942030 duration=6720 ns codeptr=0x2063b1
rec=0x47dfb0 type= 9 time=70992580981830 thread_id=1320323111743406870 target_id=0
Record Target data op: target_id=0x0 host_op_id=0x8000000000000004 optype=3 src_addr=0x7f06df420000 src_device=0 dest_addr=0x47d3a0 dest_device=1 bytes=4 end_time=70992580984910 duration=3080 ns codeptr=0x20643f
Notice how the third data transfer has the exact same time = 70992580935310 and end_time = 70992580942030. This occurs for every single data transfer happening between two devices (not just the same device).
Here's the reproducer:
callbacks.h
Click to expand
#include <assert.h>
#include <memory>
#include <unordered_set>
#define EMI 1
// Tool related code below
#include <omp-tools.h>
ompt_id_t next_op_id = 0x8000000000000001;
#define OMPT_BUFFER_REQUEST_SIZE 256
// Map of devices traced
typedef std::unordered_set<ompt_device_t*> DeviceMap_t;
typedef std::unique_ptr<DeviceMap_t> DeviceMapPtr_t;
extern DeviceMapPtr_t DeviceMapPtr;
// Utilities
static void print_record_ompt(ompt_record_ompt_t *rec) {
if (rec == NULL) return;
printf("rec=%p type=%2d time=%lu thread_id=%lu target_id=%lu\n",
rec, rec->type, rec->time, rec->thread_id, rec->target_id);
switch (rec->type) {
case ompt_callback_target:
case ompt_callback_target_emi:
{
ompt_record_target_t target_rec = rec->record.target;
printf("\tRecord Target task: target_id=0x%lx kind=%d endpoint=%d device=%d task_id=%lu codeptr=%p\n",
target_rec.target_id,
target_rec.kind, target_rec.endpoint, target_rec.device_num,
target_rec.task_id, target_rec.codeptr_ra);
break;
}
case ompt_callback_target_data_op:
case ompt_callback_target_data_op_emi:
{
ompt_record_target_data_op_t target_data_op_rec = rec->record.target_data_op;
printf("\t Record Target data op: target_id=0x%lx host_op_id=0x%lx optype=%d src_addr=%p src_device=%d "
"dest_addr=%p dest_device=%d bytes=%lu end_time=%lu duration=%lu ns codeptr=%p\n",
rec->target_id,
target_data_op_rec.host_op_id, target_data_op_rec.optype,
target_data_op_rec.src_addr, target_data_op_rec.src_device_num,
target_data_op_rec.dest_addr, target_data_op_rec.dest_device_num,
target_data_op_rec.bytes, target_data_op_rec.end_time,
target_data_op_rec.end_time - rec->time,
target_data_op_rec.codeptr_ra);
break;
}
case ompt_callback_target_submit:
case ompt_callback_target_submit_emi:
{
ompt_record_target_kernel_t target_kernel_rec = rec->record.target_kernel;
printf("\t Record Target kernel: target_id=0x%lx host_op_id=0x%lx requested_num_teams=%u granted_num_teams=%u "
"end_time=%lu duration=%lu ns\n",
rec->target_id,
target_kernel_rec.host_op_id, target_kernel_rec.requested_num_teams,
target_kernel_rec.granted_num_teams, target_kernel_rec.end_time,
target_kernel_rec.end_time - rec->time);
break;
}
default:
assert(0);
break;
}
}
static void delete_buffer_ompt(ompt_buffer_t *buffer) {
free(buffer);
printf("Deallocated %p\n", buffer);
}
// OMPT entry point handles
static ompt_set_callback_t ompt_set_callback = 0;
static ompt_set_trace_ompt_t ompt_set_trace_ompt = 0;
static ompt_start_trace_t ompt_start_trace = 0;
static ompt_flush_trace_t ompt_flush_trace = 0;
static ompt_stop_trace_t ompt_stop_trace = 0;
static ompt_get_record_ompt_t ompt_get_record_ompt = 0;
static ompt_advance_buffer_cursor_t ompt_advance_buffer_cursor = 0;
// OMPT callbacks
// Trace record callbacks
static void on_ompt_callback_buffer_request (
int device_num,
ompt_buffer_t **buffer,
size_t *bytes
) {
*bytes = OMPT_BUFFER_REQUEST_SIZE;
*buffer = malloc(*bytes);
printf("Allocated %lu bytes at %p in buffer request callback\n", *bytes, *buffer);
}
static void on_ompt_callback_buffer_complete (
int device_num,
ompt_buffer_t *buffer,
size_t bytes, /* bytes returned in this callback */
ompt_buffer_cursor_t begin,
int buffer_owned
) {
printf("Executing buffer complete callback: %d %p %lu %p %d\n",
device_num, buffer, bytes, (void*)begin, buffer_owned);
int status = 1;
ompt_buffer_cursor_t current = begin;
while (status) {
ompt_record_ompt_t *rec = ompt_get_record_ompt(buffer, current);
print_record_ompt(rec);
status = ompt_advance_buffer_cursor(NULL, /* TODO device */
buffer,
bytes,
current,
¤t);
}
if (buffer_owned) delete_buffer_ompt(buffer);
}
static ompt_set_result_t set_trace_ompt() {
if (!ompt_set_trace_ompt) return ompt_set_error;
#if EMI
ompt_set_trace_ompt(0, 1, ompt_callback_target_emi);
ompt_set_trace_ompt(0, 1, ompt_callback_target_data_op_emi);
ompt_set_trace_ompt(0, 1, ompt_callback_target_submit_emi);
#else
ompt_set_trace_ompt(0, 1, ompt_callback_target);
ompt_set_trace_ompt(0, 1, ompt_callback_target_data_op);
ompt_set_trace_ompt(0, 1, ompt_callback_target_submit);
#endif
return ompt_set_always;
}
static int start_trace(int device_num, ompt_device_t *Device) {
if (!ompt_start_trace) return 0;
// This device will be traced.
assert(DeviceMapPtr->find(Device) == DeviceMapPtr->end() &&
"Device already present in the map");
DeviceMapPtr->insert(Device);
return ompt_start_trace(Device, &on_ompt_callback_buffer_request,
&on_ompt_callback_buffer_complete);
}
static int flush_trace(ompt_device_t *Device) {
if (!ompt_flush_trace) return 0;
return ompt_flush_trace(Device);
}
static int stop_trace(ompt_device_t *Device) {
if (!ompt_stop_trace) return 0;
return ompt_stop_trace(Device);
}
// Synchronous callbacks
static void on_ompt_callback_device_initialize
(
int device_num,
const char *type,
ompt_device_t *device,
ompt_function_lookup_t lookup,
const char *documentation
) {
printf("Init: device_num=%d type=%s device=%p lookup=%p doc=%p\n",
device_num, type, device, lookup, documentation);
if (!lookup) {
printf("Trace collection disabled on device %d\n", device_num);
return;
}
ompt_set_trace_ompt = (ompt_set_trace_ompt_t) lookup("ompt_set_trace_ompt");
ompt_start_trace = (ompt_start_trace_t) lookup("ompt_start_trace");
ompt_flush_trace = (ompt_flush_trace_t) lookup("ompt_flush_trace");
ompt_stop_trace = (ompt_stop_trace_t) lookup("ompt_stop_trace");
ompt_get_record_ompt = (ompt_get_record_ompt_t) lookup("ompt_get_record_ompt");
ompt_advance_buffer_cursor = (ompt_advance_buffer_cursor_t) lookup("ompt_advance_buffer_cursor");
// DeviceMap must be initialized only once. Ensure this logic does not
// depend on external data structures because this init function may be
// called before main.
static bool IsDeviceMapInitialized = false;
if (!IsDeviceMapInitialized) {
DeviceMapPtr = std::make_unique<DeviceMap_t>();
IsDeviceMapInitialized = true;
}
set_trace_ompt();
// In many scenarios, this will be a good place to start the
// trace. If start_trace is called from the main program before this
// callback is dispatched, the start_trace handle will be null. This
// is because this device_init callback is invoked during the first
// target construct implementation.
start_trace(device_num, device);
}
static void on_ompt_callback_device_load
(
int device_num,
const char *filename,
int64_t offset_in_file,
void *vma_in_file,
size_t bytes,
void *host_addr,
void *device_addr,
uint64_t module_id
) {
printf("Load: device_num:%d filename:%s host_adddr:%p device_addr:%p bytes:%lu\n",
device_num, filename, host_addr, device_addr, bytes);
}
static void on_ompt_callback_target_data_op
(
ompt_id_t target_id,
ompt_id_t host_op_id,
ompt_target_data_op_t optype,
void *src_addr,
int src_device_num,
void *dest_addr,
int dest_device_num,
size_t bytes,
const void *codeptr_ra
) {
printf(" Callback DataOp: host_op_id=%lu optype=%d src=%p src_device_num=%d "
"dest=%p dest_device_num=%d bytes=%lu code=%p\n",
host_op_id, optype, src_addr, src_device_num,
dest_addr, dest_device_num, bytes, codeptr_ra);
}
static void on_ompt_callback_target_data_op_emi
(
ompt_scope_endpoint_t endpoint,
ompt_data_t *target_task_data,
ompt_data_t *target_data,
ompt_id_t *host_op_id,
ompt_target_data_op_t optype,
void *src_addr,
int src_device_num,
void *dest_addr,
int dest_device_num,
size_t bytes,
const void *codeptr_ra
) {
if (endpoint == ompt_scope_begin) *host_op_id = next_op_id++;
// target_task_data may be null, avoid dereferencing it
uint64_t target_task_data_value = (target_task_data) ? target_task_data->value : 0;
printf(" Callback DataOp EMI: endpoint=%d target_task_data=%p (0x%lx) target_data=%p (0x%lx) host_op_id=%p (0x%lx) optype=%d src=%p src_device_num=%d "
"dest=%p dest_device_num=%d bytes=%lu code=%p\n",
endpoint,
target_task_data, target_task_data_value,
target_data, target_data->value,
host_op_id, *host_op_id,
optype, src_addr, src_device_num,
dest_addr, dest_device_num, bytes, codeptr_ra);
}
static void on_ompt_callback_target
(
ompt_target_t kind,
ompt_scope_endpoint_t endpoint,
int device_num,
ompt_data_t *task_data,
ompt_id_t target_id,
const void *codeptr_ra
) {
printf("Callback Target: kind=%d endpoint=%d device_num=%d target_id=%lu code=%p\n",
kind, endpoint, device_num, target_id, codeptr_ra);
}
static void on_ompt_callback_target_emi
(
ompt_target_t kind,
ompt_scope_endpoint_t endpoint,
int device_num,
ompt_data_t *task_data,
ompt_data_t *target_task_data,
ompt_data_t *target_data,
const void *codeptr_ra
) {
if (endpoint == ompt_scope_begin) target_data->value = next_op_id++;
// target_task_data may be null, avoid dereferencing it
uint64_t target_task_data_value = (target_task_data) ? target_task_data->value : 0;
printf("Callback Target EMI: kind=%d endpoint=%d device_num=%d task_data=%p (0x%lx) target_task_data=%p (0x%lx) target_data=%p (0x%lx) code=%p\n",
kind, endpoint, device_num,
task_data, task_data->value,
target_task_data, target_task_data_value,
target_data, target_data->value,
codeptr_ra);
}
static void on_ompt_callback_target_submit
(
ompt_id_t target_id,
ompt_id_t host_op_id,
unsigned int requested_num_teams
) {
printf(" Callback Submit: target_id=%lu host_op_id=%lu req_num_teams=%d\n",
target_id, host_op_id, requested_num_teams);
}
static void on_ompt_callback_target_submit_emi
(
ompt_scope_endpoint_t endpoint,
ompt_data_t *target_data,
ompt_id_t *host_op_id,
unsigned int requested_num_teams
) {
printf(" Callback Submit EMI: endpoint=%d target_data=%p (0x%lx) host_op_id=%p (0x%lx) req_num_teams=%d\n",
endpoint,
target_data, target_data->value,
host_op_id, *host_op_id,
requested_num_teams);
}
// Init functions
int ompt_initialize(
ompt_function_lookup_t lookup,
int initial_device_num,
ompt_data_t *tool_data)
{
ompt_set_callback = (ompt_set_callback_t) lookup("ompt_set_callback");
ompt_set_callback(ompt_callback_device_initialize,
(ompt_callback_t)&on_ompt_callback_device_initialize);
ompt_set_callback(ompt_callback_device_load,
(ompt_callback_t)&on_ompt_callback_device_load);
#if EMI
ompt_set_callback(ompt_callback_target_submit_emi,
(ompt_callback_t)&on_ompt_callback_target_submit_emi);
ompt_set_callback(ompt_callback_target_data_op_emi,
(ompt_callback_t)&on_ompt_callback_target_data_op_emi);
ompt_set_callback(ompt_callback_target_emi,
(ompt_callback_t)&on_ompt_callback_target_emi);
#else
ompt_set_callback(ompt_callback_target_submit,
(ompt_callback_t)&on_ompt_callback_target_submit);
ompt_set_callback(ompt_callback_target_data_op,
(ompt_callback_t)&on_ompt_callback_target_data_op);
ompt_set_callback(ompt_callback_target,
(ompt_callback_t)&on_ompt_callback_target);
#endif
return 1; //success
}
void ompt_finalize(ompt_data_t *tool_data)
{
}
#ifdef __cplusplus
extern "C" {
#endif
ompt_start_tool_result_t *ompt_start_tool(
unsigned int omp_version,
const char *runtime_version)
{
static ompt_start_tool_result_t ompt_start_tool_result = {&ompt_initialize,&ompt_finalize, 0};
return &ompt_start_tool_result;
}
#ifdef __cplusplus
}
#endif
reproducer.cpp
Click to expand
#include <stdio.h>
#include <assert.h>
#include <omp.h>
#include "callbacks.h"
// Map of devices traced
DeviceMapPtr_t DeviceMapPtr;
int main()
{
int NumDevices = omp_get_num_devices();
assert(NumDevices > 0 && "No device(s) present.");
int Device = omp_get_default_device();
int Host = omp_get_initial_device();
// Note: Zero value depicts an OFFLOAD_SUCCESS
int Status;
printf("Allocating Memory on Device\n");
int *DevPtr = (int *)omp_target_alloc(sizeof(int), Device);
assert(DevPtr && "Could not allocate memory on device.");
int *HstPtr = (int *)malloc(sizeof(int));
*HstPtr = 42;
printf("Testing: Host to Device\n");
Status = omp_target_memcpy(DevPtr, HstPtr, sizeof(int), 0, 0, Device, Host);
assert(Status == 0 && "H2D memory copy operation failed.\n");
printf("Testing: Device to Device\n");
Status = omp_target_memcpy(DevPtr, DevPtr, sizeof(int), 0, 0, Device, Device);
assert(Status == 0 && "D2D memory copy operation failed.\n");
printf("Testing: Device to Host\n");
Status = omp_target_memcpy(HstPtr, DevPtr, sizeof(int), 0, 0, Host, Device);
assert(Status == 0 && "D2H memory copy operation failed.\n");
printf("Checking Correctness\n");
assert(*HstPtr == 42);
printf("Freeing Memory on Device\n");
free(HstPtr);
omp_target_free(DevPtr, Device);
for (auto Dev : *DeviceMapPtr)
flush_trace(Dev);
return 0;
}
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
Used version of AOMP for testing (built on March 6th 2024):
$ amdclang --version
AOMP_STANDALONE_19.0-0 clang version 19.0.0_AOMP_STANDALONE_19.0-0 (https://github.com/RadeonOpenCompute/llvm-project f683f043bcdd432d88210a1d9b2859af065052b4)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/apps/software/aomp/trunk/bin
Additional Information
No response