Error in DJL 0.14.0 with roberta model
With DJL 0.14.0 and a roberta model, I am getting the following error on predict. Please note that the exact same code works with DJL 0.13.0. I have written my own RobertaTokenizer and Translator. However, for the sake of this problem, I have hard-coded the inputs so Tokenizer is not needed. processInput code is at the end.
The error is as follows:
0 [main] DEBUG ai.djl.repository.zoo.DefaultModelZoo - Scanning models in repo: class ai.djl.repository.SimpleRepository, file:/mnt/d/code/djltest/../zirai/aamir/dotakb/reranker/outputs/roberta_squad2_output/traced.pt
29 [main] DEBUG ai.djl.repository.zoo.ModelZoo - Loading model with Criteria:
Application: UNDEFINED
Input: class ai.djl.modality.nlp.qa.QAInput
Output: interface java.util.List
Engine: PyTorch
ModelZoo: ai.djl.localmodelzoo
29 [main] DEBUG ai.djl.repository.zoo.ModelZoo - Searching model in specified model zoo: ai.djl.localmodelzoo43 [main] DEBUG ai.djl.engine.Engine - Found EngineProvider: PyTorch
43 [main] DEBUG ai.djl.engine.Engine - Found default engine: PyTorch
55 [main] WARN ai.djl.repository.SimpleRepository - Simple repository pointing to a non-archive file.
61 [main] DEBUG ai.djl.repository.zoo.ModelZoo - Checking ModelLoader: ai.djl.localmodelzoo:traced.pt UNDEFINED [
ai.djl.localmodelzoo/traced.pt/traced.pt {}
]
69 [main] DEBUG ai.djl.repository.MRL - Preparing artifact: file:/mnt/d/code/djltest/../zirai/aamir/dotakb/reranker/outputs/roberta_squad2_output/traced.pt, ai.djl.localmodelzoo/traced.pt/traced.pt {}
69 [main] DEBUG ai.djl.repository.SimpleRepository - Skip prepare for local repository.
Loading: 100% |████████████████████████████████████████|
309 [main] DEBUG ai.djl.util.cuda.CudaUtils - cudart library not found.
314 [main] DEBUG ai.djl.pytorch.jni.LibUtils - Using cache dir: /home/aamir/.djl.ai/pytorch/1.9.1-cpu-linux-x86_64
316 [main] INFO ai.djl.pytorch.jni.LibUtils - Extracting /jnilib/linux-x86_64/cpu/libdjl_torch.so to cache ...
444 [main] DEBUG ai.djl.pytorch.jni.LibUtils - Loading pytorch library from: /home/aamir/.djl.ai/pytorch/1.9.1-cpu-linux-x86_64/0.14.0-cpu-libdjl_torch.so
1167 [main] INFO ai.djl.pytorch.engine.PtEngine - Number of inter-op threads is 4
1168 [main] INFO ai.djl.pytorch.engine.PtEngine - Number of intra-op threads is 8
6384 [main] INFO ai.zir.djl.Predictor - Model loaded successfully...
Enter your question and context to get predicted answers via Bert model.
Enter your question or enter exit to finish:
what is the height of mount everest?
Enter your context or enter exit to finish:
There are certain tall and deep things in this world. for example, the depth of mariana trench is 80000 feet and the height of everst is 32000 feet
ai.djl.translate.TranslateException: ai.djl.engine.EngineException: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
File "code/__torch__/transformers/models/roberta/modeling_roberta.py", line 13, in forward
qa_outputs = self.qa_outputs
roberta = self.roberta
_0 = (roberta).forward(input_ids, attention_mask, )
~~~~~~~~~~~~~~~~ <--- HERE
_1 = torch.split((qa_outputs).forward(_0, ), 1, -1)
start_logits, end_logits, = _1
File "code/__torch__/transformers/models/roberta/modeling_roberta.py", line 46, in forward
_10 = torch.to(extended_attention_mask, 6)
attention_mask0 = torch.mul(torch.rsub(_10, 1.), CONSTANTS.c0)
_11 = (embeddings).forward(input_ids, input, )
~~~~~~~~~~~~~~~~~~~ <--- HERE
_12 = (encoder).forward(_11, attention_mask0, )
return _12
File "code/__torch__/transformers/models/roberta/modeling_roberta.py", line 73, in forward
incremental_indices = torch.mul(torch.add(_13, CONSTANTS.c1), mask)
input0 = torch.add(torch.to(incremental_indices, 4), CONSTANTS.c2)
_14 = (word_embeddings).forward(input_ids, )
~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
_15 = (token_type_embeddings).forward(input, )
embeddings = torch.add(_14, _15)
File "code/__torch__/torch/nn/modules/sparse.py", line 10, in forward
input_ids: Tensor) -> Tensor:
weight = self.weight
inputs_embeds = torch.embedding(weight, input_ids, 1)
~~~~~~~~~~~~~~~ <--- HERE
return inputs_embeds
Traceback of TorchScript, original code (most recent call last):
/home/aamir/.local/lib/python3.8/site-packages/torch/nn/functional.py(2044): embedding
/home/aamir/.local/lib/python3.8/site-packages/torch/nn/modules/sparse.py(158): forward
/home/aamir/.local/lib/python3.8/site-packages/torch/nn/modules/module.py(1090): _slow_forward
/home/aamir/.local/lib/python3.8/site-packages/torch/nn/modules/module.py(1102): _call_impl
/home/aamir/.local/lib/python3.8/site-packages/transformers/models/roberta/modeling_roberta.py(131): forward
/home/aamir/.local/lib/python3.8/site-packages/torch/nn/modules/module.py(1090): _slow_forward
/home/aamir/.local/lib/python3.8/site-packages/torch/nn/modules/module.py(1102): _call_impl
/home/aamir/.local/lib/python3.8/site-packages/transformers/models/roberta/modeling_roberta.py(837): forward
/home/aamir/.local/lib/python3.8/site-packages/torch/nn/modules/module.py(1090): _slow_forward
/home/aamir/.local/lib/python3.8/site-packages/torch/nn/modules/module.py(1102): _call_impl
/home/aamir/.local/lib/python3.8/site-packages/transformers/models/roberta/modeling_roberta.py(1498): forward
/home/aamir/.local/lib/python3.8/site-packages/torch/nn/modules/module.py(1090): _slow_forward
/home/aamir/.local/lib/python3.8/site-packages/torch/nn/modules/module.py(1102): _call_impl
/home/aamir/.local/lib/python3.8/site-packages/torch/jit/_trace.py(958): trace_module
/home/aamir/.local/lib/python3.8/site-packages/torch/jit/_trace.py(741): trace
/mnt/d/code/zirai/aamir/dotakb/reranker/save_torchscript.py(42): save_model_as_torchscript
/mnt/d/code/zirai/aamir/dotakb/reranker/save_torchscript.py(65): <module>
RuntimeError: index out of range in self
at ai.djl.inference.Predictor.batchPredict(Predictor.java:186)
at ai.djl.inference.Predictor.predict(Predictor.java:123)
at ai.zir.djl.Predictor.predictRoberta(Predictor.java:99)
at ai.zir.djl.DjlTest.processInputs(DjlTest.java:63)
at ai.zir.djl.DjlTest.main(DjlTest.java:43)
Caused by: ai.djl.engine.EngineException: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
File "code/__torch__/transformers/models/roberta/modeling_roberta.py", line 13, in forward
qa_outputs = self.qa_outputs
roberta = self.roberta
_0 = (roberta).forward(input_ids, attention_mask, )
~~~~~~~~~~~~~~~~ <--- HERE
_1 = torch.split((qa_outputs).forward(_0, ), 1, -1)
start_logits, end_logits, = _1
File "code/__torch__/transformers/models/roberta/modeling_roberta.py", line 46, in forward
_10 = torch.to(extended_attention_mask, 6)
attention_mask0 = torch.mul(torch.rsub(_10, 1.), CONSTANTS.c0)
_11 = (embeddings).forward(input_ids, input, )
~~~~~~~~~~~~~~~~~~~ <--- HERE
_12 = (encoder).forward(_11, attention_mask0, )
return _12
File "code/__torch__/transformers/models/roberta/modeling_roberta.py", line 73, in forward
incremental_indices = torch.mul(torch.add(_13, CONSTANTS.c1), mask)
input0 = torch.add(torch.to(incremental_indices, 4), CONSTANTS.c2)
_14 = (word_embeddings).forward(input_ids, )
~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
_15 = (token_type_embeddings).forward(input, )
embeddings = torch.add(_14, _15)
File "code/__torch__/torch/nn/modules/sparse.py", line 10, in forward
input_ids: Tensor) -> Tensor:
weight = self.weight
inputs_embeds = torch.embedding(weight, input_ids, 1)
~~~~~~~~~~~~~~~ <--- HERE
return inputs_embeds
Traceback of TorchScript, original code (most recent call last):
/home/aamir/.local/lib/python3.8/site-packages/torch/nn/functional.py(2044): embedding
/home/aamir/.local/lib/python3.8/site-packages/torch/nn/modules/sparse.py(158): forward
/home/aamir/.local/lib/python3.8/site-packages/torch/nn/modules/module.py(1090): _slow_forward
/home/aamir/.local/lib/python3.8/site-packages/torch/nn/modules/module.py(1102): _call_impl
/home/aamir/.local/lib/python3.8/site-packages/transformers/models/roberta/modeling_roberta.py(131): forward
/home/aamir/.local/lib/python3.8/site-packages/torch/nn/modules/module.py(1090): _slow_forward
/home/aamir/.local/lib/python3.8/site-packages/torch/nn/modules/module.py(1102): _call_impl
/home/aamir/.local/lib/python3.8/site-packages/transformers/models/roberta/modeling_roberta.py(837): forward
/home/aamir/.local/lib/python3.8/site-packages/torch/nn/modules/module.py(1090): _slow_forward
/home/aamir/.local/lib/python3.8/site-packages/torch/nn/modules/module.py(1102): _call_impl
/home/aamir/.local/lib/python3.8/site-packages/transformers/models/roberta/modeling_roberta.py(1498): forward
/home/aamir/.local/lib/python3.8/site-packages/torch/nn/modules/module.py(1090): _slow_forward
/home/aamir/.local/lib/python3.8/site-packages/torch/nn/modules/module.py(1102): _call_impl
/home/aamir/.local/lib/python3.8/site-packages/torch/jit/_trace.py(958): trace_module
/home/aamir/.local/lib/python3.8/site-packages/torch/jit/_trace.py(741): trace
/mnt/d/code/zirai/aamir/dotakb/reranker/save_torchscript.py(42): save_model_as_torchscript
/mnt/d/code/zirai/aamir/dotakb/reranker/save_torchscript.py(65): <module>
RuntimeError: index out of range in self
at ai.djl.pytorch.jni.PyTorchLibrary.moduleForward(Native Method)
at ai.djl.pytorch.jni.IValueUtils.forward(IValueUtils.java:46)
at ai.djl.pytorch.engine.PtSymbolBlock.forwardInternal(PtSymbolBlock.java:126)
at ai.djl.nn.AbstractBlock.forward(AbstractBlock.java:126)
at ai.djl.nn.Block.forward(Block.java:122)
at ai.djl.inference.Predictor.predictInternal(Predictor.java:137)
at ai.djl.inference.Predictor.batchPredict(Predictor.java:177)
... 4 more
For the purpose of making this simpler, I have hard-coded the input in the processInput and masked it to 128. Here is how the processInput in translator looks like:
@Override
public NDList processInput(TranslatorContext translatorContext, QAInput qaInput) throws Exception {
NDManager manager = translatorContext.getNDManager();
// Hard-coded array of roberta indices.
long[] indices = new long[] {0, 12196, 16, 5, 6958, 9, 14206, 15330, 7110, 116, 2, 2, 37099, 9, 15330, 7110, 16, 2107, 151, 1730, 2};
int INPUT_LENGTH = 128;
long[] finalIndices = new long[INPUT_LENGTH];
long[] attentionMasks = new long[INPUT_LENGTH];
// Masking upto INPUT_LENGTH
System.arraycopy(indices, 0, finalIndices, 0, indices.length);
Arrays.fill(finalIndices, indices.length, INPUT_LENGTH, 1);
Arrays.fill(attentionMasks, 0, indices.length, 1);
Arrays.fill(attentionMasks, indices.length, INPUT_LENGTH, 0);
NDArray indicesArray = manager.create(finalIndices);
NDArray attentionMaskArray = manager.create(attentionMasks);
// The order matters
return new NDList(indicesArray, attentionMaskArray);
}
predict code is simply as follows:
var predictor = model.newPredictor(translator);
predictor.predict(new QAInput(question, paragraph));
Error appears at this line: predictor.predict(new QAInput(question, paragraph));
@frankfliu Can you please take a look at this? I was having this problem in an inferentia instance where DJL 0.14.0 was needed. However, I found out that the error is with DJL 0.14.0 and has nothing to do with Inferentia. The same code works fine with 0.13.0
@aamirbutt The different between 0.13.0 and 0.14.0 is the PyTorch version, can you try use PyTorch 1.9.0?
export PYTORCH_VERSION=1.9.0
You can also try to run it with python and see if pytorch 1.9.1 has the same issue.
The problem looks to be with DJL 0.14.0. Here is a working output from DJL 0.13.0 which loads pytorch 1.9.1 apparently:
0 [main] DEBUG ai.djl.repository.zoo.DefaultModelZoo - Scanning models in repo: class ai.djl.repository.SimpleRepository, file:/D:/code/zirai/aamir/dotakb/reranker/outputs/roberta_squad2_output/traced.pt
4 [main] DEBUG ai.djl.repository.zoo.ModelZoo - Loading model with Criteria:
Application: UNDEFINED
Input: class ai.djl.modality.nlp.qa.QAInput
Output: interface java.util.List
Engine: PyTorch
ModelZoo: ai.djl.localmodelzoo
4 [main] DEBUG ai.djl.repository.zoo.ModelZoo - Searching model in specified model zoo: ai.djl.localmodelzoo
10 [main] DEBUG ai.djl.engine.Engine - Found EngineProvider: PyTorch
13 [main] DEBUG ai.djl.engine.Engine - Found default engine: PyTorch
16 [main] WARN ai.djl.repository.SimpleRepository - Simple repository pointing to a non-archive file.
18 [main] DEBUG ai.djl.repository.zoo.ModelZoo - Checking ModelLoader: ai.djl.localmodelzoo:traced.pt UNDEFINED [
ai.djl.localmodelzoo/traced.pt/traced.pt {}
]
21 [main] DEBUG ai.djl.repository.MRL - Preparing artifact: file:/D:/code/zirai/aamir/dotakb/reranker/outputs/roberta_squad2_output/traced.pt, ai.djl.localmodelzoo/traced.pt/traced.pt {}
21 [main] DEBUG ai.djl.repository.SimpleRepository - Skip prepare for local repository.
Loading: 100% |========================================|
112 [main] DEBUG ai.djl.util.cuda.CudaUtils - No cudart library found in path.
123 [main] DEBUG ai.djl.pytorch.jni.LibUtils - Using cache dir: C:\Users\Aamir\.djl.ai\pytorch
209 [main] DEBUG ai.djl.pytorch.jni.LibUtils - Loading pytorch library from: C:\Users\Aamir\.djl.ai\pytorch\1.9.1-cpu-win-x86_64\0.13.0-cpu-djl_torch.dll
1023 [main] INFO ai.djl.pytorch.engine.PtEngine - Number of inter-op threads is 4
1318 [main] INFO ai.djl.pytorch.engine.PtEngine - Number of intra-op threads is 8
9886 [main] INFO ai.zir.djl.Predictor - Model loaded successfully...
Enter your question and context to get predicted answers via Bert model.
Enter your question or enter exit to finish:
what is the height of mount everest?
Enter your context or enter exit to finish:
There are certain tall and deep things in this world. for example, the depth of mariana trench is 80000 feet and the height of everst is 32000 feet
Answer: 32000 feet
Press Enter to continue...
I tried your suggestion of loading pytorch 1.9.0 and the problem persists.
This is really strange. Would you mind build DJL from source using git bisect to find which commit cause the issue?
You need manually build pytorch JNI and then test your code:
call "C:\Program Files (x86)\Microsoft Visual Studio\2019\Enterprise\VC\Auxiliary\Build\vcvarsall.bat" amd64
gradlew :engines:pytorch:pytorch-native:compileJNI -Ppt_version=1.9.1
I don't have VS2019 license, unfortunately.
Visual Studio Community addition is free.