glake question about torch 2.1.0 integration

Thanks for your sharing! I'm greatly appreciate your work for reducing the cuda memory fragmentation. Recently I have integrated GMLake into torch2.1.0 and finished compiling without error. I would like to know how to confirm if GMLake is working properly, as I did not find any reduction in peak memory reserved during using Lora to train Llama2-7B. garbage_collect_fused_blocks() function jumps to the error handling section, and does it causing GMLake not working?

Here are some running logs with only 6 iterations trainning steps.

node-9658:4032281:4036703 [0] GMLAKE_INFO get_fused_fragmented_blocks():4159 fused block 0x25f46060, ptr 0x12a0000000 of size 512.000000MB
node-9658:4032281:4036703 [0] GMLAKE_INFO get_fused_fragmented_blocks():4171 try 0: fuse 256 physical blocks to ptr 0x12a0000000 of size 512.000000MB for allocate size 512.000000MB succeeded, takes 20.435480ms, total_fuse_size 32558.000000MB
node-9658:4032281:4032281 [0] GMLAKE_INFO get_fused_fragmented_blocks():4159 fused block 0x2a994d70, ptr 0x12c0000000 of size 954.000000MB
node-9658:4032281:4032281 [0] GMLAKE_INFO get_fused_fragmented_blocks():4171 try 0: fuse 477 physical blocks to ptr 0x12c0000000 of size 954.000000MB for allocate size 954.000000MB succeeded, takes 40.207251ms, total_fuse_size 33512.000000MB
node-9658:4032281:4036703 [0] GMLAKE_INFO get_fused_fragmented_blocks():4159 fused block 0x2ba9e650, ptr 0x1320000000 of size 512.000000MB
node-9658:4032281:4036703 [0] GMLAKE_INFO get_fused_fragmented_blocks():4171 try 0: fuse 256 physical blocks to ptr 0x1320000000 of size 512.000000MB for allocate size 512.000000MB succeeded, takes 20.692452ms, total_fuse_size 34024.000000MB
node-9658:4032281:4032281 [0] GMLAKE_INFO get_fused_fragmented_blocks():4159 fused block 0x2bab4010, ptr 0x1340000000 of size 954.000000MB
node-9658:4032281:4032281 [0] GMLAKE_INFO get_fused_fragmented_blocks():4171 try 0: fuse 477 physical blocks to ptr 0x1340000000 of size 954.000000MB for allocate size 954.000000MB succeeded, takes 51.173343ms, total_fuse_size 34978.000000MB
node-9658:4032281:4036703 [0] GMLAKE_INFO get_fused_fragmented_blocks():4159 fused block 0x2b83a6b0, ptr 0x13a0000000 of size 512.000000MB
node-9658:4032281:4036703 [0] GMLAKE_INFO get_fused_fragmented_blocks():4171 try 0: fuse 256 physical blocks to ptr 0x13a0000000 of size 512.000000MB for allocate size 512.000000MB succeeded, takes 30.265250ms, total_fuse_size 35490.000000MB
node-9658:4032281:4032281 [0] GMLAKE_INFO get_fused_fragmented_blocks():4159 fused block 0x26fa6af0, ptr 0x13c0000000 of size 954.000000MB
node-9658:4032281:4032281 [0] GMLAKE_INFO get_fused_fragmented_blocks():4171 try 0: fuse 477 physical blocks to ptr 0x13c0000000 of size 954.000000MB for allocate size 954.000000MB succeeded, takes 49.731019ms, total_fuse_size 36444.000000MB
node-9658:4032281:4032281 [0] GMLAKE_INFO get_fused_fragmented_blocks():4159 fused block 0x28e575f0, ptr 0x13fc000000 of size 954.000000MB
node-9658:4032281:4032281 [0] GMLAKE_INFO get_fused_fragmented_blocks():4171 try 0: fuse 477 physical blocks to ptr 0x13fc000000 of size 954.000000MB for allocate size 954.000000MB succeeded, takes 40.066690ms, total_fuse_size 37398.000000MB
{'train_runtime': 25.9383, 'train_samples_per_second': 1.851, 'train_steps_per_second': 0.231, 'train_loss': 1.7313324610392253, 'epoch': 1.0}
node-9658:4032281:4032281 [0] GMLAKE_INFO garbage_collect_fused_blocks():3824 gc from fragmented_free_fused_blocks: blocks 0, size 0.000000MB
node-9658:4032281:4032281 [0] GMLAKE_INFO garbage_collect_fused_blocks():3893 gc from free_fused_blocks_in_release_order: blocks 0, size 0.000000MB
node-9658:4032281:4032281 [0] GMLAKE_INFO emptyCache():2425 garbage_collect_fused_blocks() return 0MB garbage memory
node-9658:4032281:4032281 [0] GMLAKE_INFO garbage_collect_fused_blocks():3824 gc from fragmented_free_fused_blocks: blocks 0, size 0.000000MB
node-9658:4032281:4032281 [0] GMLAKE_INFO garbage_collect_fused_blocks():3893 gc from free_fused_blocks_in_release_order: blocks 0, size 0.000000MB
node-9658:4032281:4032281 [0] GMLAKE_INFO emptyCache():2425 garbage_collect_fused_blocks() return 0MB garbage memory
node-9658:4032281:4032281 [0] GMLAKE_INFO garbage_collect_fused_blocks():3824 gc from fragmented_free_fused_blocks: blocks 0, size 0.000000MB
node-9658:4032281:4032281 [0] GMLAKE_INFO garbage_collect_fused_blocks():3893 gc from free_fused_blocks_in_release_order: blocks 0, size 0.000000MB
node-9658:4032281:4032281 [0] GMLAKE_INFO emptyCache():2425 garbage_collect_fused_blocks() return 0MB garbage memory
node-9658:4032281:4032281 [0] GMLAKE_INFO garbage_collect_fused_blocks():3824 gc from fragmented_free_fused_blocks: blocks 0, size 0.000000MB
node-9658:4032281:4032281 [0] GMLAKE_INFO garbage_collect_fused_blocks():3893 gc from free_fused_blocks_in_release_order: blocks 0, size 0.000000MB
node-9658:4032281:4032281 [0] GMLAKE_INFO emptyCache():2425 garbage_collect_fused_blocks() return 0MB garbage memory
node-9658:4032281:4032281 [0] GMLAKE_INFO garbage_collect_fused_blocks():3824 gc from fragmented_free_fused_blocks: blocks 0, size 0.000000MB
node-9658:4032281:4032281 [0] GMLAKE_INFO garbage_collect_fused_blocks():3893 gc from free_fused_blocks_in_release_order: blocks 0, size 0.000000MB
node-9658:4032281:4032281 [0] GMLAKE_INFO emptyCache():2425 garbage_collect_fused_blocks() return 0MB garbage memory
node-9658:4032281:4032281 [0] GMLAKE_INFO garbage_collect_fused_blocks():3824 gc from fragmented_free_fused_blocks: blocks 0, size 0.000000MB
node-9658:4032281:4032281 [0] GMLAKE_INFO garbage_collect_fused_blocks():3893 gc from free_fused_blocks_in_release_order: blocks 0, size 0.000000MB

Apr 18 '24 02:04 Pegessi

hi @Pegessi I am currently working on the master branch of the repository and have encountered a compilation error when attempting to build the project with PyTorch version 2.1. which branch do you use?

mnt/pytorch/c10/cuda/CUDACachingAllocator.cpp:3499:27: error: ‘struct c10::cuda::CUDACachingAllocator::Native::{anonymous}::HistoryChain’ has no member named ‘h’ 3499 | block->history->h.context); | ^ /mnt/pytorch/c10/cuda/CUDACachingAllocator.cpp: In member function ‘void c10::cuda::CUDACachingAllocator::Native::DeviceCachingAllocator::record_trace(c10::cuda::CUDACachingAllocator::TraceEntry::Action, int64_t, size_t, cudaStream_t, int)’: /mnt/pytorch/c10/cuda/CUDACachingAllocator.cpp:3669:37: error: operands to ‘?:’ have different types ‘std::remove_reference<int&>::type’ {aka ‘int’} and ‘std::nullptr_t’ 3669 | alloc_trace_record_context_ ? std::move(context) : nullptr); | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ /mnt/pytorch/c10/cuda/CUDACachingAllocator.cpp: At global scope: /mnt/pytorch/c10/cuda/CUDACachingAllocator.cpp:3802:8: error: ‘void c10::cuda::CUDACachingAllocator::Native::NativeCachingAllocator::recordHistory(bool, c10::cuda::CUDACachingAllocator::CreateContextFn, size_t, bool)’ marked ‘override’, but does not override 3802 | void recordHistory( | ^~~~~~~~~~~~~ /mnt/pytorch/c10/cuda/CUDACachingAllocator.cpp:3925:8: error: ‘void c10::cuda::CUDACachingAllocator::Native::NativeCachingAllocator::notifyCaptureBegin(int, c10::cuda::CaptureId_t, c10::cuda::MempoolId_t)’ marked ‘override’, but does not override 3925 | void notifyCaptureBegin( | ^~~~~~~~~~~~~~~~~~ /mnt/pytorch/c10/cuda/CUDACachingAllocator.cpp:3934:8: error: ‘void c10::cuda::CUDACachingAllocator::Native::NativeCachingAllocator::notifyCaptureAboutToEnd(int, c10::cuda::CaptureId_t)’ marked ‘override’, but does not override 3934 | void notifyCaptureAboutToEnd(int device, CaptureId_t graph_id) override { | ^~~~~~~~~~~~~~~~~~~~~~~ /mnt/pytorch/c10/cuda/CUDACachingAllocator.cpp:3939:8: error: ‘void c10::cuda::CUDACachingAllocator::Native::NativeCachingAllocator::notifyCaptureEnded(int, c10::cuda::CaptureId_t)’ marked ‘override’, but does not override 3939 | void notifyCaptureEnded(int device, CaptureId_t graph_id) override {} // no-op | ^~~~~~~~~~~~~~~~~~ /mnt/pytorch/c10/cuda/CUDACachingAllocator.cpp:3941:8: error: ‘void c10::cuda::CUDACachingAllocator::Native::NativeCachingAllocator::notifyCaptureDestroy(int, c10::cuda::MempoolId_t)’ marked ‘override’, but does not override 3941 | void notifyCaptureDestroy(int device, MempoolId_t mempool_id) override { | ^~~~~~~~~~~~~~~~~~~~ /mnt/pytorch/c10/cuda/CUDACachingAllocator.cpp:3967:8: error: ‘bool c10::cuda::CUDACachingAllocator::Native::NativeCachingAllocator::needsPoolSpecificPeerAccess()’ marked ‘override’, but does not override 3967 | bool needsPoolSpecificPeerAccess() override { | ^~~~~~~~~~~~~~~~~~~~~~~~~~~ /mnt/pytorch/c10/cuda/CUDACachingAllocator.cpp:4031:24: error: cannot declare variable ‘c10::cuda::CUDACachingAllocator::Native::allocator’ to be of abstract type ‘c10::cuda::CUDACachingAllocator::Native::NativeCachingAllocator’ 4031 | NativeCachingAllocator allocator; | ^~~~~~~~~

Apr 30 '24 08:04 uygnef

I have integrated GMLake into torch2.1.0 manually by myself. These code cannot directly be used to replace the file in pytorch2.1.0
because of some changes about interfaces in cudacachingallocator.h&cpp. Although my manual version can be build successfully and logs display the virtual memory has been created, it's still not sure if GMLake works because my version doesn't reduce the peak memory during DNN training and overhead is great for the first few iterations.

May 09 '24 12:05 Pegessi

@Pegessi Have you ever encountered the following problem? when I patch GMLake to torch2.1.0, I found that sometimes when call func release_block where cudaFree small block, cuda will raise illegal memory

frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3f2 (0x7f3e8791e1f2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so) frame #3: + 0x232c5 (0x7f3e878e52c5 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so) frame #4: + 0x2ed2a (0x7f3e878f0d2a in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so) frame #5: + 0x1e4f5 (0x7f3e878e04f5 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so) frame #6: + 0x22a7e (0x7f3e878e4a7e in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so) frame #7: + 0x3a07a (0x7f3e878fc07a in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so) frame #8: + 0x3ac79 (0x7f3e878fcc79 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so) frame #9: + 0x3b143 (0x7f3e878fd143 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so) frame #10: THCPModule_emptyCache(_object*, _object*) + 0x37 (0x7f3e86a442b7 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so) frame #11: + 0x157a3e (0x5648cf980a3e in /usr/bin/python) frame #12: _PyEval_EvalFrameDefault + 0x614a (0x5648cf971cfa in /usr/bin/python) frame #13: _PyFunction_Vectorcall + 0x7c (0x5648cf9839fc in /usr/bin/python) frame #14: _PyEval_EvalFrameDefault + 0x614a (0x5648cf971cfa in /usr/bin/python) frame #15: _PyFunction_Vectorcall + 0x7c (0x5648cf9839fc in /usr/bin/python) frame #16: _PyEval_EvalFrameDefault + 0x6bd (0x5648cf96c26d in /usr/bin/python) frame #17: _PyFunction_Vectorcall + 0x7c (0x5648cf9839fc in /usr/bin/python) frame #18: _PyEval_EvalFrameDefault + 0x6bd (0x5648cf96c26d in /usr/bin/python) frame #19: _PyFunction_Vectorcall + 0x7c (0x5648cf9839fc in /usr/bin/python) frame #20: _PyEval_EvalFrameDefault + 0x198c (0x5648cf96d53c in /usr/bin/python) frame #21: + 0x13f9c6 (0x5648cf9689c6 in /usr/bin/python) frame #22: PyEval_EvalCode + 0x86 (0x5648cfa5e256 in /usr/bin/python) frame #23: + 0x260108 (0x5648cfa89108 in /usr/bin/python) frame #24: + 0x2599cb (0x5648cfa829cb in /usr/bin/python) frame #25: + 0x25fe55 (0x5648cfa88e55 in /usr/bin/python) frame #26: _PyRun_SimpleFileObject + 0x1a8 (0x5648cfa88338 in /usr/bin/python) frame #27: _PyRun_AnyFileObject + 0x43 (0x5648cfa87f83 in /usr/bin/python) frame #28: Py_RunMain + 0x2be (0x5648cfa7aa5e in /usr/bin/python) frame #29: Py_BytesMain + 0x2d (0x5648cfa5102d in /usr/bin/python) frame #30: + 0x29d90 (0x7f3e88356d90 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #31: __libc_start_main + 0x80 (0x7f3e88356e40 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #32: _start + 0x25 (0x5648cfa50f25 in /usr/bin/python) terminate called after throwing an instance of ' c10::Error' what(): CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Sep 19 '24 11:09 dream110fly

glake glake copied to clipboard

question about torch 2.1.0 integration

glake
glake copied to clipboard