recsys-examples
recsys-examples copied to clipboard
[Enhancement] Memory waste in segmented_unique
Problem
The segmented_unique function allocates table_num separate buffers for tmp_unique_indices and tmp_accumulated_frequency_output, each of size num_total, but processes tables serially and only uses a small portion of each buffer. This wastes significant GPU memory.
Location
File: /corelib/dynamicemb/src/index_calculation.cu
std::vector<at::Tensor> tmp_unique_indices(table_num);
for (int i = 0; i < table_num; ++i) {
tmp_unique_indices[i] = at::empty_like(keys); // Each buffer size: num_total
}
Todo
Use a single shared buffer of size num_total or directly write to the final output buffer using slices with offsets.
By submitting this issue, you agree to follow our code of conduct and our contributing guidelines.