lit Cuda memory errors when running pytorch example

When the following script: https://github.com/PAIR-code/lit/blob/main/lit_nlp/examples/simple_pytorch_demo.py I am getting cuda out of memory issues, regardless of max_batch_size or number of gpus used. I have access to 10 gpus with around 11gb vram each, so definitely should be fine.

I am running the code as it is on the repo, so won't paste here. But here is the error:

Traceback (most recent call last):
  File "/home/niallt/venvs/39nlp/lib/python3.9/site-packages/lit_nlp/lib/wsgi_app.py", line 191, in __call__
    return self._ServeCustomHandler(request, clean_path, environ)(
  File "/home/niallt/venvs/39nlp/lib/python3.9/site-packages/lit_nlp/lib/wsgi_app.py", line 176, in _ServeCustomHandler
    return self._handlers[clean_path](self, request, environ)
  File "/home/niallt/venvs/39nlp/lib/python3.9/site-packages/lit_nlp/app.py", line 385, in _handler
    outputs = fn(data, **kw)
  File "/home/niallt/venvs/39nlp/lib/python3.9/site-packages/lit_nlp/app.py", line 305, in _get_interpretations
    model_outputs = self._predict(data['inputs'], model, dataset_name)
  File "/home/niallt/venvs/39nlp/lib/python3.9/site-packages/lit_nlp/app.py", line 146, in _predict
    return list(self._models[model_name].predict_with_metadata(
  File "/home/niallt/venvs/39nlp/lib/python3.9/site-packages/lit_nlp/lib/caching.py", line 182, in predict_with_metadata
    results = self._predict_with_metadata(*args, **kw)
  File "/home/niallt/venvs/39nlp/lib/python3.9/site-packages/lit_nlp/lib/caching.py", line 211, in _predict_with_metadata
    model_preds = list(self.wrapped.predict_with_metadata(model_inputs))
  File "/home/niallt/venvs/39nlp/lib/python3.9/site-packages/lit_nlp/api/model.py", line 197, in <genexpr>
    results = (scrub_numpy_refs(res) for res in results)
  File "/home/niallt/venvs/39nlp/lib/python3.9/site-packages/lit_nlp/api/model.py", line 209, in _batched_predict
    yield from self.predict_minibatch(minibatch, **kw)
  File "/home/niallt/lit_nlp/lit_nlp/examples/simple_pytorch_demo.py", line 118, in predict_minibatch
    self.model.cuda()
  File "/home/niallt/venvs/39nlp/lib/python3.9/site-packages/torch/nn/modules/module.py", line 688, in cuda
    return self._apply(lambda t: t.cuda(device))
  File "/home/niallt/venvs/39nlp/lib/python3.9/site-packages/torch/nn/modules/module.py", line 578, in _apply
    module._apply(fn)
  File "/home/niallt/venvs/39nlp/lib/python3.9/site-packages/torch/nn/modules/module.py", line 578, in _apply
    module._apply(fn)
  File "/home/niallt/venvs/39nlp/lib/python3.9/site-packages/torch/nn/modules/module.py", line 578, in _apply
    module._apply(fn)
  File "/home/niallt/venvs/39nlp/lib/python3.9/site-packages/torch/nn/modules/module.py", line 601, in _apply
    param_applied = fn(param)
  File "/home/niallt/venvs/39nlp/lib/python3.9/site-packages/torch/nn/modules/module.py", line 688, in <lambda>
    return self._apply(lambda t: t.cuda(device))
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

I have got this working fine with the standard lit-nlp demo which I presume is using tensorflow backend by default, but my own models / codebases will require pytorch.

Any thoughts on what may be causing this? I am not an expert on how lit-nlp is processing the data behind the scenes, but its occuring during the predict_minibatch() and I can confirm it doesn't get past passing the model and then the batch to cuda.

e.g. I added some debugging prints to check what was going on with:

def predict_minibatch(self, inputs):
    # Preprocess to ids and masks, and make the input batch.
    encoded_input = self.tokenizer.batch_encode_plus(
        [ex["sentence"] for ex in inputs],
        return_tensors="pt",
        add_special_tokens=True,
        max_length=128,
        padding="longest",
        truncation="longest_first")
    print(f"encoded input is: {encoded_input}")
    # Check and send to cuda (GPU) if available
    if torch.cuda.is_available():
      print(f"cuda avaialble!")
      self.model.cuda()
      for tensor in encoded_input:
        print(f"tensor is: {tensor}")
        encoded_input[tensor] = encoded_input[tensor].cuda()

    print(f"encoded input after passing to cuda is: {encoded_input}")
    # Run a forward pass.
    with torch.no_grad():  # remove this if you need gradients.
      out: transformers.modeling_outputs.SequenceClassifierOutput = \
          self.model(**encoded_input)

    # Post-process outputs.
    batched_outputs = {
        "probas": torch.nn.functional.softmax(out.logits, dim=-1),
        "input_ids": encoded_input["input_ids"],
        "ntok": torch.sum(encoded_input["attention_mask"], dim=1),
        "cls_emb": out.hidden_states[-1][:, 0],  # last layer, first token
    }
    # Return as NumPy for further processing.
    detached_outputs = {k: v.cpu().numpy() for k, v in batched_outputs.items()}
    # Unbatch outputs so we get one record per input example.
    for output in utils.unbatch_preds(detached_outputs):
      ntok = output.pop("ntok")
      output["tokens"] = self.tokenizer.convert_ids_to_tokens(
          output.pop("input_ids")[1:ntok - 1])
      yield output

I0812 14:55:26.451673 140234135095104 caching.py:210] Prepared 872 inputs for model encoded input is: {'input_ids': tensor([[ 101, 2009, 1005, 1055, 1037, 11951, 1998, 2411, 12473, 4990, 1012, 102], [ 101, 4895, 10258, 2378, 8450, 2135, 21657, 1998, 7143, 102, 0, 0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0]])} cuda avaialble! E0812 14:55:26.461915 140234135095104 wsgi_app.py:208] Uncaught error: CUDA error: out of memory CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Any thoughts would be much appreciated. The GPU environment I have can handle these models very easily ordinarily.

Thanks in advance!

Aug 12 '22 13:08 NtaylorOX

This is odd, given that it's GPU memory I'm not sure it's from LIT necessarily - in particular, LIT doesn't know about CUDA or the GPU at all, and that's entirely handled through the model code. If you just instantiate the model class and call predict_minibatch() directly from Python, do you get the same error?

See https://github.com/PAIR-code/lit/blob/main/lit_nlp/examples/notebooks/LIT_Components_Example.ipynb for an example notebook that shows how to use LIT components without starting a server. Another thing you could try is running the server with --warm_start=1, which will run inference on start-up in a single thread, which can make things easier to debug.

In terms of how the data is handled: predict_minibatch() gets called in a loop here: https://github.com/PAIR-code/lit/blob/main/lit_nlp/api/model.py#L200, the model gets wrapped in CachingModelWrapper: https://github.com/PAIR-code/lit/blob/main/lit_nlp/lib/caching.py#L100, and then predict() gets called in a couple of places in app.py: https://github.com/PAIR-code/lit/blob/main/lit_nlp/app.py. If you test in a notebook, though, you should be able to skip all of that and call your predict_minibatch() directly.

Aug 17 '22 04:08 iftenney

Hi, Thanks for the quick reply.

It does feel odd and I can confirm running outside of the server within straight python/jupyter notebook still runs into cuda out of memory:

Using the below in a notebook as an example.

list(models['sst_tiny'].predict_minibatch(datasets['sst_dev'].examples))

Annoyingly just tested on a google colab and it worked fine... Although the colab instance has 15gb vram vs my 11ish gb. It all seems to happen when model.cuda() is called under the hood of the predict_batch function. And the GPU memory usage inflates to around 11gb... vs the usual 1069 MiB.

So I can now confirm it is not actally the mode.cuda() that is the issue, its that the dataset has been I guess preloaded onto the cuda device or something?

If I call model.cuda() BEFORE creating the dataset the models gpu usage is normal. So I guess its whatever is happening to the dataset is the problem here. But as I mentioned before, batch size changes nothing and the cuda memory issues are being caused by the dataset creation/loading via:

datasets = {'sst_dev': glue.SST2Data('validation')}

Any further thoughts?

Thanks

Aug 18 '22 14:08 NtaylorOX

lit lit copied to clipboard

Cuda memory errors when running pytorch example

lit
lit copied to clipboard