pytorch_memlab icon indicating copy to clipboard operation
pytorch_memlab copied to clipboard

What's the difference between `active_bytes` and `reserved_bytes`?

Open nyngwang opened this issue 1 year ago • 3 comments

I need to show that some technique called gradient checkpointing can really save GPU memory usage during backward propagation. When I see the result there are two columns on the left showing active_bytes and reserved_bytes. In my testing, while active bytes read 3.83G, the reserved bytes read 9.35G. So why does PyTorch still reserve that much GPU memory?

nyngwang avatar Aug 27 '22 15:08 nyngwang

PyTorch caches CUDA memory to prevent repeated memory allocatation cost, you can get more information here:

https://pytorch.org/docs/stable/notes/cuda.html#cuda-memory-management

In your case, the reserved bytes should be peak memory usage before checkpointing, while active bytes should be the current memory usage after `checkpointing'

Stonesjtu avatar Aug 29 '22 06:08 Stonesjtu

## VGG.forward

active_bytes reserved_bytes line code
         all            all
        peak           peak
       5.71G         10.80G   50     @profile
                              51     def forward(self, x):
       3.86G          8.77G   52         out = self.features(x)
       2.19G          8.77G   53         out = self.classifier(out)
       2.19G          8.77G   54         return out

@Stonesjtu Could you help me re-check the code above: I checkpointed the self.features internally (which itself is a nn.Module with nn.Sequential inside) but added the @profile decorator on the forward method (as above) of the outer class that uses the features (conv2d layers)

Q1: Do you know how to explain this: If I keep the same batch-size, but change how I partition the self.features internally (into checkpointed segments), the active_bytes of the next non-checkpointed line self.classifier(out) also changed.

I also have two additional lines printed by the following code before the stats above printed:

Max CUDA memory allocated on forward:  1.22G
Max CUDA memory allocated on backward:  5.71G

which are generated by the code appended below.

Q2: So how to explain the reserved_bytes, i.e. 10.80G, 8.77G, in the stats generated by pytorch_memlab above? Does it mean that pytorch internally allocates much more GPU memory than it really needs?

# compute output
if i < 1:
    torch.cuda.reset_peak_memory_stats()
output = model(images)
loss = criterion(output, target)
if i < 1:
    print('Max CUDA memory allocated on forward: ', utils.readable_size(torch.cuda.max_memory_allocated()))

# measure accuracy and record loss
acc1, acc5 = accuracy(output, target, topk=(1, 5))
losses.update(loss.detach().item(), images.size(0))
top1.update(acc1[0], images.size(0))
top5.update(acc5[0], images.size(0))

# compute gradient and do SGD step
if i < 1:
    torch.cuda.reset_peak_memory_stats()
optimizer.zero_grad()
loss.backward()
optimizer.step()
if i < 1:
    print('Max CUDA memory allocated on backward: ', utils.readable_size(torch.cuda.max_memory_allocated()))

nyngwang avatar Aug 30 '22 07:08 nyngwang

Q1: Do you know how to explain this: If I keep the same batch-size, but change how I partition the self.features internally (into checkpointed segments), the active_bytes of the next non-checkpointed line self.classifier(out) also changed.

The column (or metric) active bytes peak all is actually the peak active bytes during the execution of this line, it's an accumulated value which depends on the active bytes before the execution of this line.

e.g. you have 4 Linear layer in nn.Sequential, checkpointing after layers[after] would consume less active bytes than checkpointing after layer[0].


Q2: So how to explain the reserved_bytes, i.e. 10.80G, 8.77G, in the stats generated by pytorch_memlab above? Does it mean that pytorch internally allocates much more GPU memory than it really needs?

According to the pytorch documentation:

PyTorch uses a caching memory allocator to speed up memory allocations. This allows fast memory deallocation without device synchronizations. However, the unused memory managed by the allocator will still show as if used in nvidia-smi.

Actually it needs the cached memory at a certain point of execution, but at the time of your torch.cuda.max_memory_allocated, it doesn't need so much memory space. You can try torch.cuda.empty_cache() before getting torch.cuda.max_memory_allocated.

Stonesjtu avatar Aug 30 '22 15:08 Stonesjtu