lit
lit copied to clipboard
Cuda memory errors when running pytorch example
When the following script: https://github.com/PAIR-code/lit/blob/main/lit_nlp/examples/simple_pytorch_demo.py I am getting cuda out of memory issues, regardless of max_batch_size or number of gpus used. I have access to 10 gpus with around 11gb vram each, so definitely should be fine.
I am running the code as it is on the repo, so won't paste here. But here is the error:
Traceback (most recent call last):
File "/home/niallt/venvs/39nlp/lib/python3.9/site-packages/lit_nlp/lib/wsgi_app.py", line 191, in __call__
return self._ServeCustomHandler(request, clean_path, environ)(
File "/home/niallt/venvs/39nlp/lib/python3.9/site-packages/lit_nlp/lib/wsgi_app.py", line 176, in _ServeCustomHandler
return self._handlers[clean_path](self, request, environ)
File "/home/niallt/venvs/39nlp/lib/python3.9/site-packages/lit_nlp/app.py", line 385, in _handler
outputs = fn(data, **kw)
File "/home/niallt/venvs/39nlp/lib/python3.9/site-packages/lit_nlp/app.py", line 305, in _get_interpretations
model_outputs = self._predict(data['inputs'], model, dataset_name)
File "/home/niallt/venvs/39nlp/lib/python3.9/site-packages/lit_nlp/app.py", line 146, in _predict
return list(self._models[model_name].predict_with_metadata(
File "/home/niallt/venvs/39nlp/lib/python3.9/site-packages/lit_nlp/lib/caching.py", line 182, in predict_with_metadata
results = self._predict_with_metadata(*args, **kw)
File "/home/niallt/venvs/39nlp/lib/python3.9/site-packages/lit_nlp/lib/caching.py", line 211, in _predict_with_metadata
model_preds = list(self.wrapped.predict_with_metadata(model_inputs))
File "/home/niallt/venvs/39nlp/lib/python3.9/site-packages/lit_nlp/api/model.py", line 197, in <genexpr>
results = (scrub_numpy_refs(res) for res in results)
File "/home/niallt/venvs/39nlp/lib/python3.9/site-packages/lit_nlp/api/model.py", line 209, in _batched_predict
yield from self.predict_minibatch(minibatch, **kw)
File "/home/niallt/lit_nlp/lit_nlp/examples/simple_pytorch_demo.py", line 118, in predict_minibatch
self.model.cuda()
File "/home/niallt/venvs/39nlp/lib/python3.9/site-packages/torch/nn/modules/module.py", line 688, in cuda
return self._apply(lambda t: t.cuda(device))
File "/home/niallt/venvs/39nlp/lib/python3.9/site-packages/torch/nn/modules/module.py", line 578, in _apply
module._apply(fn)
File "/home/niallt/venvs/39nlp/lib/python3.9/site-packages/torch/nn/modules/module.py", line 578, in _apply
module._apply(fn)
File "/home/niallt/venvs/39nlp/lib/python3.9/site-packages/torch/nn/modules/module.py", line 578, in _apply
module._apply(fn)
File "/home/niallt/venvs/39nlp/lib/python3.9/site-packages/torch/nn/modules/module.py", line 601, in _apply
param_applied = fn(param)
File "/home/niallt/venvs/39nlp/lib/python3.9/site-packages/torch/nn/modules/module.py", line 688, in <lambda>
return self._apply(lambda t: t.cuda(device))
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
I have got this working fine with the standard lit-nlp demo which I presume is using tensorflow backend by default, but my own models / codebases will require pytorch.
Any thoughts on what may be causing this? I am not an expert on how lit-nlp is processing the data behind the scenes, but its occuring during the predict_minibatch() and I can confirm it doesn't get past passing the model and then the batch to cuda.
e.g. I added some debugging prints to check what was going on with:
def predict_minibatch(self, inputs):
# Preprocess to ids and masks, and make the input batch.
encoded_input = self.tokenizer.batch_encode_plus(
[ex["sentence"] for ex in inputs],
return_tensors="pt",
add_special_tokens=True,
max_length=128,
padding="longest",
truncation="longest_first")
print(f"encoded input is: {encoded_input}")
# Check and send to cuda (GPU) if available
if torch.cuda.is_available():
print(f"cuda avaialble!")
self.model.cuda()
for tensor in encoded_input:
print(f"tensor is: {tensor}")
encoded_input[tensor] = encoded_input[tensor].cuda()
print(f"encoded input after passing to cuda is: {encoded_input}")
# Run a forward pass.
with torch.no_grad(): # remove this if you need gradients.
out: transformers.modeling_outputs.SequenceClassifierOutput = \
self.model(**encoded_input)
# Post-process outputs.
batched_outputs = {
"probas": torch.nn.functional.softmax(out.logits, dim=-1),
"input_ids": encoded_input["input_ids"],
"ntok": torch.sum(encoded_input["attention_mask"], dim=1),
"cls_emb": out.hidden_states[-1][:, 0], # last layer, first token
}
# Return as NumPy for further processing.
detached_outputs = {k: v.cpu().numpy() for k, v in batched_outputs.items()}
# Unbatch outputs so we get one record per input example.
for output in utils.unbatch_preds(detached_outputs):
ntok = output.pop("ntok")
output["tokens"] = self.tokenizer.convert_ids_to_tokens(
output.pop("input_ids")[1:ntok - 1])
yield output
I0812 14:55:26.451673 140234135095104 caching.py:210] Prepared 872 inputs for model encoded input is: {'input_ids': tensor([[ 101, 2009, 1005, 1055, 1037, 11951, 1998, 2411, 12473, 4990, 1012, 102], [ 101, 4895, 10258, 2378, 8450, 2135, 21657, 1998, 7143, 102, 0, 0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0]])} cuda avaialble! E0812 14:55:26.461915 140234135095104 wsgi_app.py:208] Uncaught error: CUDA error: out of memory CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Any thoughts would be much appreciated. The GPU environment I have can handle these models very easily ordinarily.
Thanks in advance!
This is odd, given that it's GPU memory I'm not sure it's from LIT necessarily - in particular, LIT doesn't know about CUDA or the GPU at all, and that's entirely handled through the model code. If you just instantiate the model class and call predict_minibatch()
directly from Python, do you get the same error?
See https://github.com/PAIR-code/lit/blob/main/lit_nlp/examples/notebooks/LIT_Components_Example.ipynb for an example notebook that shows how to use LIT components without starting a server. Another thing you could try is running the server with --warm_start=1
, which will run inference on start-up in a single thread, which can make things easier to debug.
In terms of how the data is handled: predict_minibatch()
gets called in a loop here: https://github.com/PAIR-code/lit/blob/main/lit_nlp/api/model.py#L200, the model gets wrapped in CachingModelWrapper
: https://github.com/PAIR-code/lit/blob/main/lit_nlp/lib/caching.py#L100, and then predict()
gets called in a couple of places in app.py: https://github.com/PAIR-code/lit/blob/main/lit_nlp/app.py. If you test in a notebook, though, you should be able to skip all of that and call your predict_minibatch()
directly.
Hi, Thanks for the quick reply.
It does feel odd and I can confirm running outside of the server within straight python/jupyter notebook still runs into cuda out of memory:
Using the below in a notebook as an example.
list(models['sst_tiny'].predict_minibatch(datasets['sst_dev'].examples))
Annoyingly just tested on a google colab and it worked fine... Although the colab instance has 15gb vram vs my 11ish gb. It all seems to happen when model.cuda() is called under the hood of the predict_batch function. And the GPU memory usage inflates to around 11gb... vs the usual 1069 MiB.
So I can now confirm it is not actally the mode.cuda() that is the issue, its that the dataset has been I guess preloaded onto the cuda device or something?
If I call model.cuda() BEFORE creating the dataset the models gpu usage is normal. So I guess its whatever is happening to the dataset is the problem here. But as I mentioned before, batch size changes nothing and the cuda memory issues are being caused by the dataset creation/loading via:
datasets = {'sst_dev': glue.SST2Data('validation')}
Any further thoughts?
Thanks