Alfred Xu
Alfred Xu
/kind feature **Describe the solution you'd like** Currently, katib has three kinds of succeed reason: 1. ExperimentGoalReached 2. ExperimentMaxTrialsReached 3. ExperimentSuggestionEndReached In my opinion, it will bring a lot convenient...
Fix #15705 1. Replacing `Cuda.memset(Long.MAX_VALUE, (byte) 0, 1024)` with `Cuda.freePinned(-1L)`, the previous one throws fatal CUDAError `cudaErrorIllegalAddress` instead of nonFatal CUDAError `cudaErrorInvalidValue` under aarch64, while the later one throwing the...
**Describe the bug** For the test case `CudaTest.testCudaException`: ```java assertThrows(CudaException.class, () -> { try { Cuda.memset(Long.MAX_VALUE, (byte) 0, 1024); } catch (CudaFatalException ignored) { } catch (CudaException ex) { assertEquals(CudaException.CudaError.cudaErrorInvalidValue,...
Closes #13969 ### Overview This PR tightly couples the virtual memory budget with the lifecycle of the actual memory buffer `HostMemoryBuffer` used in the runner, by making `MemoryBoundedAsyncRunner` serve as...
**Is your feature request related to a problem? Please describe.** The current `ResourceBoundedExecutor` manages asynchronous scanning using a "virtual" budget (Triple Buffer MemManagement) that is loosely coupled with the actual...
**Is your feature request related to a problem? Please describe.** Based on insights from #13884, we observed severe OOM retries and semaphore waits during batch concatenation. The current implementation of...
**Describe the bug** When running a heavy GpuAggregate consisting of over 400 aggregate functions (including hundreds of comprehensive function `stddev_pop` ), significant amount of spill is observed in the map...