unilm icon indicating copy to clipboard operation
unilm copied to clipboard

LayoutLMv3 document layout train error

Open yuumiin opened this issue 3 years ago • 2 comments

Describe Model I am using LayoutLMv3

after running object_detection/train_net.py
cuda error RuntimeError: CUDA out of memory

Error occurred after adjusting IMS_PER_BATCH in cascade_layoutlmv3.yaml TypeError: save_for_backward can only save variables, but argument 6 is of type bool

How to solve it???

yuumiin avatar May 30 '22 07:05 yuumiin

It is hard to locate the cause of errors and debug without error stack traces. Have you set IMS_PER_BATCH to a multiple of your GPU size? For example, if you are using 8 GPUs, then you could set IMS_PER_BATCH to 8, 16, etc.

HYPJUDY avatar Jun 01 '22 03:06 HYPJUDY

@HYPJUDY image I also encountered this problem,How can I solve it?

wyc2015fq avatar Jul 22 '22 10:07 wyc2015fq

Same here, interesting.. I am using PyTorch 1.8.0 and I suspect there can be issues with this version. This problem happens inside gradient checkpointing function in Torch. In PyTorch 1.8.0, the gradient checkpointing function saves everything in the input for backward, which must be torch.tensors. This can be problematic if there are non-tensor input as function input (layer_module.forward in this case). Below is the source code of CheckPointFunction in Torch 1.8.0:

class CheckpointFunction(torch.autograd.Function):

    @staticmethod
    def forward(ctx, run_function, preserve_rng_state, *args):
        check_backward_validity(args)
        ctx.run_function = run_function
        ctx.preserve_rng_state = preserve_rng_state
        ctx.had_autocast_in_fwd = torch.is_autocast_enabled()
        if preserve_rng_state:
            ctx.fwd_cpu_state = torch.get_rng_state()
            # Don't eagerly initialize the cuda context by accident.
            # (If the user intends that the context is initialized later, within their
            # run_function, we SHOULD actually stash the cuda state here.  Unfortunately,
            # we have no way to anticipate this will happen before we run the function.)
            ctx.had_cuda_in_fwd = False
            if torch.cuda._initialized:
                ctx.had_cuda_in_fwd = True
                ctx.fwd_gpu_devices, ctx.fwd_gpu_states = get_device_states(*args)
        ctx.save_for_backward(*args)
        with torch.no_grad():
            outputs = run_function(*args)
        return outputs

However, this issue is fixed in PyTorch 1.10.0, as also explained in this blog. The difference lies in L27-L41.

So update your PyTorch version should fix the issue.

haolibai avatar Sep 05 '22 14:09 haolibai

Windows: set 'PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512'

Linux: export 'PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512'

mllife avatar Jul 24 '23 08:07 mllife