glake
glake copied to clipboard
question about torch 2.1.0 integration
Thanks for your sharing! I'm greatly appreciate your work for reducing the cuda memory fragmentation. Recently I have integrated GMLake into torch2.1.0 and finished compiling without error. I would like to know how to confirm if GMLake is working properly, as I did not find any reduction in peak memory reserved during using Lora to train Llama2-7B.
garbage_collect_fused_blocks() function jumps to the error handling section, and does it causing GMLake not working?
Here are some running logs with only 6 iterations trainning steps.
node-9658:4032281:4036703 [0] GMLAKE_INFO get_fused_fragmented_blocks():4159 fused block 0x25f46060, ptr 0x12a0000000 of size 512.000000MB
node-9658:4032281:4036703 [0] GMLAKE_INFO get_fused_fragmented_blocks():4171 try 0: fuse 256 physical blocks to ptr 0x12a0000000 of size 512.000000MB for allocate size 512.000000MB succeeded, takes 20.435480ms, total_fuse_size 32558.000000MB
node-9658:4032281:4032281 [0] GMLAKE_INFO get_fused_fragmented_blocks():4159 fused block 0x2a994d70, ptr 0x12c0000000 of size 954.000000MB
node-9658:4032281:4032281 [0] GMLAKE_INFO get_fused_fragmented_blocks():4171 try 0: fuse 477 physical blocks to ptr 0x12c0000000 of size 954.000000MB for allocate size 954.000000MB succeeded, takes 40.207251ms, total_fuse_size 33512.000000MB
node-9658:4032281:4036703 [0] GMLAKE_INFO get_fused_fragmented_blocks():4159 fused block 0x2ba9e650, ptr 0x1320000000 of size 512.000000MB
node-9658:4032281:4036703 [0] GMLAKE_INFO get_fused_fragmented_blocks():4171 try 0: fuse 256 physical blocks to ptr 0x1320000000 of size 512.000000MB for allocate size 512.000000MB succeeded, takes 20.692452ms, total_fuse_size 34024.000000MB
node-9658:4032281:4032281 [0] GMLAKE_INFO get_fused_fragmented_blocks():4159 fused block 0x2bab4010, ptr 0x1340000000 of size 954.000000MB
node-9658:4032281:4032281 [0] GMLAKE_INFO get_fused_fragmented_blocks():4171 try 0: fuse 477 physical blocks to ptr 0x1340000000 of size 954.000000MB for allocate size 954.000000MB succeeded, takes 51.173343ms, total_fuse_size 34978.000000MB
node-9658:4032281:4036703 [0] GMLAKE_INFO get_fused_fragmented_blocks():4159 fused block 0x2b83a6b0, ptr 0x13a0000000 of size 512.000000MB
node-9658:4032281:4036703 [0] GMLAKE_INFO get_fused_fragmented_blocks():4171 try 0: fuse 256 physical blocks to ptr 0x13a0000000 of size 512.000000MB for allocate size 512.000000MB succeeded, takes 30.265250ms, total_fuse_size 35490.000000MB
node-9658:4032281:4032281 [0] GMLAKE_INFO get_fused_fragmented_blocks():4159 fused block 0x26fa6af0, ptr 0x13c0000000 of size 954.000000MB
node-9658:4032281:4032281 [0] GMLAKE_INFO get_fused_fragmented_blocks():4171 try 0: fuse 477 physical blocks to ptr 0x13c0000000 of size 954.000000MB for allocate size 954.000000MB succeeded, takes 49.731019ms, total_fuse_size 36444.000000MB
node-9658:4032281:4032281 [0] GMLAKE_INFO get_fused_fragmented_blocks():4159 fused block 0x28e575f0, ptr 0x13fc000000 of size 954.000000MB
node-9658:4032281:4032281 [0] GMLAKE_INFO get_fused_fragmented_blocks():4171 try 0: fuse 477 physical blocks to ptr 0x13fc000000 of size 954.000000MB for allocate size 954.000000MB succeeded, takes 40.066690ms, total_fuse_size 37398.000000MB
{'train_runtime': 25.9383, 'train_samples_per_second': 1.851, 'train_steps_per_second': 0.231, 'train_loss': 1.7313324610392253, 'epoch': 1.0}
node-9658:4032281:4032281 [0] GMLAKE_INFO garbage_collect_fused_blocks():3824 gc from fragmented_free_fused_blocks: blocks 0, size 0.000000MB
node-9658:4032281:4032281 [0] GMLAKE_INFO garbage_collect_fused_blocks():3893 gc from free_fused_blocks_in_release_order: blocks 0, size 0.000000MB
node-9658:4032281:4032281 [0] GMLAKE_INFO emptyCache():2425 garbage_collect_fused_blocks() return 0MB garbage memory
node-9658:4032281:4032281 [0] GMLAKE_INFO garbage_collect_fused_blocks():3824 gc from fragmented_free_fused_blocks: blocks 0, size 0.000000MB
node-9658:4032281:4032281 [0] GMLAKE_INFO garbage_collect_fused_blocks():3893 gc from free_fused_blocks_in_release_order: blocks 0, size 0.000000MB
node-9658:4032281:4032281 [0] GMLAKE_INFO emptyCache():2425 garbage_collect_fused_blocks() return 0MB garbage memory
node-9658:4032281:4032281 [0] GMLAKE_INFO garbage_collect_fused_blocks():3824 gc from fragmented_free_fused_blocks: blocks 0, size 0.000000MB
node-9658:4032281:4032281 [0] GMLAKE_INFO garbage_collect_fused_blocks():3893 gc from free_fused_blocks_in_release_order: blocks 0, size 0.000000MB
node-9658:4032281:4032281 [0] GMLAKE_INFO emptyCache():2425 garbage_collect_fused_blocks() return 0MB garbage memory
node-9658:4032281:4032281 [0] GMLAKE_INFO garbage_collect_fused_blocks():3824 gc from fragmented_free_fused_blocks: blocks 0, size 0.000000MB
node-9658:4032281:4032281 [0] GMLAKE_INFO garbage_collect_fused_blocks():3893 gc from free_fused_blocks_in_release_order: blocks 0, size 0.000000MB
node-9658:4032281:4032281 [0] GMLAKE_INFO emptyCache():2425 garbage_collect_fused_blocks() return 0MB garbage memory
node-9658:4032281:4032281 [0] GMLAKE_INFO garbage_collect_fused_blocks():3824 gc from fragmented_free_fused_blocks: blocks 0, size 0.000000MB
node-9658:4032281:4032281 [0] GMLAKE_INFO garbage_collect_fused_blocks():3893 gc from free_fused_blocks_in_release_order: blocks 0, size 0.000000MB
node-9658:4032281:4032281 [0] GMLAKE_INFO emptyCache():2425 garbage_collect_fused_blocks() return 0MB garbage memory
node-9658:4032281:4032281 [0] GMLAKE_INFO garbage_collect_fused_blocks():3824 gc from fragmented_free_fused_blocks: blocks 0, size 0.000000MB
node-9658:4032281:4032281 [0] GMLAKE_INFO garbage_collect_fused_blocks():3893 gc from free_fused_blocks_in_release_order: blocks 0, size 0.000000MB
hi @Pegessi I am currently working on the master branch of the repository and have encountered a compilation error when attempting to build the project with PyTorch version 2.1. which branch do you use?
mnt/pytorch/c10/cuda/CUDACachingAllocator.cpp:3499:27: error: ‘struct c10::cuda::CUDACachingAllocator::Native::{anonymous}::HistoryChain’ has no member named ‘h’ 3499 | block->history->h.context); | ^ /mnt/pytorch/c10/cuda/CUDACachingAllocator.cpp: In member function ‘void c10::cuda::CUDACachingAllocator::Native::DeviceCachingAllocator::record_trace(c10::cuda::CUDACachingAllocator::TraceEntry::Action, int64_t, size_t, cudaStream_t, int)’: /mnt/pytorch/c10/cuda/CUDACachingAllocator.cpp:3669:37: error: operands to ‘?:’ have different types ‘std::remove_reference<int&>::type’ {aka ‘int’} and ‘std::nullptr_t’ 3669 | alloc_trace_record_context_ ? std::move(context) : nullptr); | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ /mnt/pytorch/c10/cuda/CUDACachingAllocator.cpp: At global scope: /mnt/pytorch/c10/cuda/CUDACachingAllocator.cpp:3802:8: error: ‘void c10::cuda::CUDACachingAllocator::Native::NativeCachingAllocator::recordHistory(bool, c10::cuda::CUDACachingAllocator::CreateContextFn, size_t, bool)’ marked ‘override’, but does not override 3802 | void recordHistory( | ^~~~~~~~~~~~~ /mnt/pytorch/c10/cuda/CUDACachingAllocator.cpp:3925:8: error: ‘void c10::cuda::CUDACachingAllocator::Native::NativeCachingAllocator::notifyCaptureBegin(int, c10::cuda::CaptureId_t, c10::cuda::MempoolId_t)’ marked ‘override’, but does not override 3925 | void notifyCaptureBegin( | ^~~~~~~~~~~~~~~~~~ /mnt/pytorch/c10/cuda/CUDACachingAllocator.cpp:3934:8: error: ‘void c10::cuda::CUDACachingAllocator::Native::NativeCachingAllocator::notifyCaptureAboutToEnd(int, c10::cuda::CaptureId_t)’ marked ‘override’, but does not override 3934 | void notifyCaptureAboutToEnd(int device, CaptureId_t graph_id) override { | ^~~~~~~~~~~~~~~~~~~~~~~ /mnt/pytorch/c10/cuda/CUDACachingAllocator.cpp:3939:8: error: ‘void c10::cuda::CUDACachingAllocator::Native::NativeCachingAllocator::notifyCaptureEnded(int, c10::cuda::CaptureId_t)’ marked ‘override’, but does not override 3939 | void notifyCaptureEnded(int device, CaptureId_t graph_id) override {} // no-op | ^~~~~~~~~~~~~~~~~~ /mnt/pytorch/c10/cuda/CUDACachingAllocator.cpp:3941:8: error: ‘void c10::cuda::CUDACachingAllocator::Native::NativeCachingAllocator::notifyCaptureDestroy(int, c10::cuda::MempoolId_t)’ marked ‘override’, but does not override 3941 | void notifyCaptureDestroy(int device, MempoolId_t mempool_id) override { | ^~~~~~~~~~~~~~~~~~~~ /mnt/pytorch/c10/cuda/CUDACachingAllocator.cpp:3967:8: error: ‘bool c10::cuda::CUDACachingAllocator::Native::NativeCachingAllocator::needsPoolSpecificPeerAccess()’ marked ‘override’, but does not override 3967 | bool needsPoolSpecificPeerAccess() override { | ^~~~~~~~~~~~~~~~~~~~~~~~~~~ /mnt/pytorch/c10/cuda/CUDACachingAllocator.cpp:4031:24: error: cannot declare variable ‘c10::cuda::CUDACachingAllocator::Native::allocator’ to be of abstract type ‘c10::cuda::CUDACachingAllocator::Native::NativeCachingAllocator’ 4031 | NativeCachingAllocator allocator; | ^~~~~~~~~
I have integrated GMLake into torch2.1.0 manually by myself. These code cannot directly be used to replace the file in pytorch2.1.0
because of some changes about interfaces in cudacachingallocator.h&cpp. Although my manual version can be build successfully and logs display the virtual memory has been created, it's still not sure if GMLake works because my version doesn't reduce the peak memory during DNN training and overhead is great for the first few iterations.
@Pegessi Have you ever encountered the following problem? when I patch GMLake to torch2.1.0, I found that sometimes when call func release_block where cudaFree small block, cuda will raise illegal memory
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3f2 (0x7f3e8791e1f2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: TORCH_USE_CUDA_DSA
to enable device-side assertions.