seldon-core
seldon-core copied to clipboard
Inconsistency of the notebook model used in GPT2 notebooks
Describe the bug
As described in the Custom pre-processors with the V2 protocol notebook, the model is adapted from Pretrained GPT2 Model Deployment Example notebook. However, I tried to use the model in the second notebook instead of Seldon core model and it resulted in error.
To reproduce
changed schema in the second notebook:
apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
name: gpt2-separate
spec:
annotations:
seldon.io/engine-separate-pod: "true"
protocol: v2
predictors:
- name: default
graph:
name: tokeniser-encoder
children:
- name: onnx-gpt2
implementation: TRITON_SERVER
modelUri: s3://language-models
envSecretRefName: seldon-init-container-secret
children:
- name: tokeniser-decoder
componentSpecs:
- spec:
containers:
- name: onnx-gpt2
- spec:
containers:
- name: tokeniser-encoder
image: sdghafouri/gpt2-tokeniser:0.1.0
imagePullPolicy: Always
env:
# Use always a writable HuggingFace cache location regardless of the user
- name: TRANSFORMERS_CACHE
value: /opt/mlserver/.cache
- name: MLSERVER_MODEL_NAME
value: "tokeniser-encoder"
- spec:
containers:
- name: tokeniser-decoder
image: sdghafouri/gpt2-tokeniser:0.1.0
imagePullPolicy: Always
env:
- name: SELDON_TOKENIZER_TYPE
value: "DECODER"
# Use always a writable HuggingFace cache location regardless of the user
- name: TRANSFORMERS_CACHE
value: /opt/mlserver/.cache
- name: MLSERVER_MODEL_NAME
value: "tokeniser-decoder"
Error in the last part of the notebook:
%%bash
curl localhost:32000/seldon/default/gpt2-separate/v2/models/infer \
-H 'Content-Type: application/json' \
-d '{"inputs": [{"name": "sentences", "datatype": "BYTES", "shape": [1, 11], "data": ["Seldon Technologies is very"]}]}'
output:
concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/local/lib/python3.8/concurrent/futures/process.py", line 239, in _process_worker
r = call_item.fn(*call_item.args, **call_item.kwargs)
File "/usr/local/lib/python3.8/site-packages/mlserver/parallel.py", line 57, in _mp_predict
return asyncio.run(_mp_model.predict(payload))
File "/usr/local/lib/python3.8/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/usr/local/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
return future.result()
File "/opt/mlserver/./runtime.py", line 37, in predict
next_token_str = self._tokeniser.decode(
File "/usr/local/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 3221, in decode
return self._decode(
File "/usr/local/lib/python3.8/site-packages/transformers/tokenization_utils.py", line 914, in _decode
filtered_tokens = self.convert_ids_to_tokens(token_ids, skip_special_tokens=skip_special_tokens)
File "/usr/local/lib/python3.8/site-packages/transformers/tokenization_utils.py", line 889, in convert_ids_to_tokens
index = int(index)
TypeError: int() argument must be a string, a bytes-like object or a number, not 'list'
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/starlette/middleware/errors.py", line 159, in __call__
await self.app(scope, receive, _send)
File "/usr/local/lib/python3.8/site-packages/starlette_exporter/middleware.py", line 135, in __call__
await self.app(scope, receive, wrapped_send)
File "/usr/local/lib/python3.8/site-packages/starlette/exceptions.py", line 82, in __call__
raise exc
File "/usr/local/lib/python3.8/site-packages/starlette/exceptions.py", line 71, in __call__
await self.app(scope, receive, sender)
File "/usr/local/lib/python3.8/site-packages/fastapi/middleware/asyncexitstack.py", line 21, in __call__
raise e
File "/usr/local/lib/python3.8/site-packages/fastapi/middleware/asyncexitstack.py", line 18, in __call__
await self.app(scope, receive, send)
File "/usr/local/lib/python3.8/site-packages/starlette/routing.py", line 656, in __call__
await route.handle(scope, receive, send)
File "/usr/local/lib/python3.8/site-packages/starlette/routing.py", line 259, in handle
await self.app(scope, receive, send)
File "/usr/local/lib/python3.8/site-packages/starlette/routing.py", line 61, in app
response = await func(request)
File "/usr/local/lib/python3.8/site-packages/mlserver/rest/app.py", line 27, in custom_route_handler
return await original_route_handler(request)
File "/usr/local/lib/python3.8/site-packages/fastapi/routing.py", line 227, in app
raw_response = await run_endpoint_function(
File "/usr/local/lib/python3.8/site-packages/fastapi/routing.py", line 160, in run_endpoint_function
return await dependant.call(**values)
File "/usr/local/lib/python3.8/site-packages/mlserver/rest/endpoints.py", line 59, in infer
inference_response = await self._data_plane.infer(
File "/usr/local/lib/python3.8/site-packages/mlserver/handlers/dataplane.py", line 60, in infer
prediction = await model.predict(payload)
File "/usr/local/lib/python3.8/site-packages/mlserver/parallel.py", line 117, in _inner
return await pool.predict(payload)
File "/usr/local/lib/python3.8/site-packages/mlserver/parallel.py", line 88, in predict
return await loop.run_in_executor(self._executor, _mp_predict, payload)
TypeError: int() argument must be a string, a bytes-like object or a number, not 'list'
Expected behaviour
User-generated model through the notebook works the same as the model in the Seldon gs.
Environment
- Cloud Provider: Bare Metal
- Kubernetes Cluster Version
Client Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.2", GitCommit:"f66044f4361b9f1f96f0053dd46cb7dce5e990a8", GitTreeState:"clean", BuildDate:"2022-06-15T14:22:29Z", GoVersion:"go1.18.3", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.4
Server Version: version.Info{Major:"1", Minor:"24+", GitVersion:"v1.24.0-2+59bbb3530b6769", GitCommit:"59bbb3530b6769e4935a05ac0e13c9910c79253e", GitTreeState:"clean", BuildDate:"2022-05-13T06:41:13Z", GoVersion:"go1.18.1", Compiler:"gc", Platform:"linux/amd64"}
- Deployed Seldon System Images:
value: docker.io/seldonio/seldon-core-executor:1.14.0
image: docker.io/seldonio/seldon-core-operator:1.14.0
@adriangonz any ideas?
Hey @saeid93 ,
Thanks for opening this ticket.
Would you be able to share the file / folder tree of the s3://language-models bucket? It would also be great if you could share the logs of the Triton container within your inference graph.
Hey @adriangonz, sorry for the late reply.
This is the foldering:
mc tree minio/language-models
minio/language-models
└─ onnx-gpt2
└─ 1
and
mc ls minio/language-models/onnx-gpt2/1
[2022-08-27 17:56:31 UTC] 622MiB STANDARD model.onnx
Here is also the output of the Triton server logs, looks like the model has been successfully loaded:
k logs gpt2-separate-default-0-onnx-gpt2-d84db944b-ntbwm
Defaulted container "onnx-gpt2" out of: onnx-gpt2, onnx-gpt2-model-initializer (init)
=============================
== Triton Inference Server ==
=============================
NVIDIA Release 21.08 (build 26170506)
Copyright (c) 2018-2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Various files include modifications (c) NVIDIA CORPORATION. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
find: '/usr/lib/ssl/private': Permission denied
WARNING: The NVIDIA Driver was not detected. GPU functionality will not be available.
Use Docker with NVIDIA Container Toolkit to start this container; see
https://github.com/NVIDIA/nvidia-docker.
I0827 18:05:48.236865 1 libtorch.cc:1029] TRITONBACKEND_Initialize: pytorch
I0827 18:05:48.236968 1 libtorch.cc:1039] Triton TRITONBACKEND API version: 1.4
I0827 18:05:48.236974 1 libtorch.cc:1045] 'pytorch' TRITONBACKEND API version: 1.4
2022-08-27 18:05:48.397308: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
I0827 18:05:48.434756 1 tensorflow.cc:2169] TRITONBACKEND_Initialize: tensorflow
I0827 18:05:48.434785 1 tensorflow.cc:2179] Triton TRITONBACKEND API version: 1.4
I0827 18:05:48.434790 1 tensorflow.cc:2185] 'tensorflow' TRITONBACKEND API version: 1.4
I0827 18:05:48.434794 1 tensorflow.cc:2209] backend configuration:
{}
I0827 18:05:48.436959 1 onnxruntime.cc:1970] TRITONBACKEND_Initialize: onnxruntime
I0827 18:05:48.436980 1 onnxruntime.cc:1980] Triton TRITONBACKEND API version: 1.4
I0827 18:05:48.436985 1 onnxruntime.cc:1986] 'onnxruntime' TRITONBACKEND API version: 1.4
I0827 18:05:48.457442 1 openvino.cc:1193] TRITONBACKEND_Initialize: openvino
I0827 18:05:48.457461 1 openvino.cc:1203] Triton TRITONBACKEND API version: 1.4
I0827 18:05:48.457467 1 openvino.cc:1209] 'openvino' TRITONBACKEND API version: 1.4
W0827 18:05:48.457585 1 pinned_memory_manager.cc:236] Unable to allocate pinned system memory, pinned memory pool will not be available: CUDA driver version is insufficient for CUDA runtime version
I0827 18:05:48.457607 1 cuda_memory_manager.cc:115] CUDA memory pool disabled
I0827 18:05:48.458017 1 model_repository_manager.cc:1045] loading: onnx-gpt2:1
I0827 18:05:48.558684 1 onnxruntime.cc:2029] TRITONBACKEND_ModelInitialize: onnx-gpt2 (version 1)
I0827 18:05:50.940684 1 onnxruntime.cc:2072] TRITONBACKEND_ModelInstanceInitialize: onnx-gpt2 (CPU device 0)
I0827 18:05:52.838040 1 model_repository_manager.cc:1212] successfully loaded 'onnx-gpt2' version 1
I0827 18:05:52.838357 1 server.cc:504]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+
I0827 18:05:52.838585 1 server.cc:543]
+-------------+-----------------------------------------------------------------+--------+
| Backend | Path | Config |
+-------------+-----------------------------------------------------------------+--------+
| tensorrt | <built-in> | {} |
| pytorch | /opt/tritonserver/backends/pytorch/libtriton_pytorch.so | {} |
| tensorflow | /opt/tritonserver/backends/tensorflow1/libtriton_tensorflow1.so | {} |
| onnxruntime | /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.so | {} |
| openvino | /opt/tritonserver/backends/openvino/libtriton_openvino.so | {} |
+-------------+-----------------------------------------------------------------+--------+
I0827 18:05:52.838673 1 server.cc:586]
+-----------+---------+--------+
| Model | Version | Status |
+-----------+---------+--------+
| onnx-gpt2 | 1 | READY |
+-----------+---------+--------+
I0827 18:05:52.838896 1 tritonserver.cc:1718]
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option | Value |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id | triton |
| server_version | 2.13.0 |
| server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data statistics |
| model_repository_path[0] | /mnt/models |
| model_control_mode | MODE_NONE |
| strict_model_config | 0 |
| pinned_memory_pool_byte_size | 268435456 |
| min_supported_compute_capability | 6.0 |
| strict_readiness | 1 |
| exit_timeout | 30 |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
I0827 18:05:52.842717 1 grpc_server.cc:4111] Started GRPCInferenceService at 0.0.0.0:9500
I0827 18:05:52.843503 1 http_server.cc:2803] Started HTTPService at 0.0.0.0:9000
I0827 18:05:52.886133 1 http_server.cc:162] Started Metrics Service at 0.0.0.0:8002
I still get the same error as mentioned in the original comment. It works fine with the seldon-hosted gpt2 image but using the source notebook model (as mentioned in the doc) is returning some error.
Hey @saeid93 ,
Thanks for providing that info.
As you say, it does seem like Triton is loading the model just fine. Therefore, it could be that something has changed in the interface of the model, and that's what's causing incompatibilities with the rest of the graph components.
To double check that,
- Could you share the output of querying the model metadata of this model? This would be similar to what it's done in this step of the original example: https://docs.seldon.io/projects/seldon-core/en/latest/examples/triton_gpt2_example.html#Interact-with-the-model:-get-model-metadata-(a-
- Could you send a "standalone" request to the Triton container and share its response? This would be similar to what it's done in this step of the original example: https://docs.seldon.io/projects/seldon-core/en/latest/examples/triton_gpt2_example.html#Run-prediction-test:-generate-a-sentence-completion-using-GPT2-model---Greedy-approach
Hey @adriangonz,
You're welcome Sure, here is the requested information:
- using
curl -s http://localhost:32000/seldon/default/gpt2-separate/v2/models/onnx-gpt2I get the following,
{"status":{"code":500,"info":"Get \"http://gpt2-separate-default-onnx-gpt2.default.svc.cluster.local.:9001/v2/models/onnx-gpt2\": dial tcp 10.152.183.124:9001: i/o timeout","status":"FAILURE"}}%
- Not sure about the definition of the standalone process, assuming it means trying to directly accessing the triton server node I ran the following:
import json
import numpy as np
import requests
from transformers import GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
input_text = "I enjoy working in Seldon"
count = 0
max_gen_len = 10
gen_sentence = input_text
while count < max_gen_len:
input_ids = tokenizer.encode(gen_sentence, return_tensors="tf")
shape = input_ids.shape.as_list()
payload = {
"inputs": [
{
"name": "input_ids",
"datatype": "INT32",
"shape": shape,
"data": input_ids.numpy().tolist(),
},
{
"name": "attention_mask",
"datatype": "INT32",
"shape": shape,
"data": np.ones(shape, dtype=np.int32).tolist(),
},
]
}
ret = requests.post(
"http://localhost:32000/seldon/default/gpt2-separate/v2/models/onnx-gpt2/infer",
json=payload,
)
try:
res = ret.json()
except:
continue
print(res)
It results to the following:
{'error': 'Model onnx-gpt2 not found'}
Hey @saeid93 ,
It seems there may be some nuances with the executor. To avoid getting sidetracked by those, would be able to deploy the Triton model separately (as in, without the pre-/post-processors) and run the same requests as above?
Hey @adriangonz,
Using the following:
apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
name: gpt2-separate
spec:
annotations:
seldon.io/engine-separate-pod: "true"
protocol: v2
predictors:
- name: default
graph:
name: onnx-gpt2
implementation: TRITON_SERVER
modelUri: s3://language-models
envSecretRefName: seldon-init-container-secret
componentSpecs:
- spec:
containers:
- name: onnx-gpt2
And it can successfully run the model and return the un-post-processed outputs. Any idea why this is happening? Shouldn't the both Seldon hosted model and the model I have generated behave the same?
There's clearly some difference between your artefact and the one hosted by Seldon. What we're trying to do now is identify what that difference is. So far, we've managed to establish that the folder structure and model settings look similar. Next step is to validate whether the outputs it sends back are the same as the model hosted by Seldon.
With that in mind, would you be able to share the outputs you got @saeid93 ?
Hey @adriangonz,
You're welcome Sure, here is the requested information:
- using
curl -s http://localhost:32000/seldon/default/gpt2-separate/v2/models/onnx-gpt2I get the following,{"status":{"code":500,"info":"Get \"http://gpt2-separate-default-onnx-gpt2.default.svc.cluster.local.:9001/v2/models/onnx-gpt2\": dial tcp 10.152.183.124:9001: i/o timeout","status":"FAILURE"}}%
- Not sure about the definition of the standalone process, assuming it means trying to directly accessing the triton server node I ran the following:
import json import numpy as np import requests from transformers import GPT2Tokenizer tokenizer = GPT2Tokenizer.from_pretrained("gpt2") input_text = "I enjoy working in Seldon" count = 0 max_gen_len = 10 gen_sentence = input_text while count < max_gen_len: input_ids = tokenizer.encode(gen_sentence, return_tensors="tf") shape = input_ids.shape.as_list() payload = { "inputs": [ { "name": "input_ids", "datatype": "INT32", "shape": shape, "data": input_ids.numpy().tolist(), }, { "name": "attention_mask", "datatype": "INT32", "shape": shape, "data": np.ones(shape, dtype=np.int32).tolist(), }, ] } ret = requests.post( "http://localhost:32000/seldon/default/gpt2-separate/v2/models/onnx-gpt2/infer", json=payload, ) try: res = ret.json() except: continue print(res)It results to the following:
{'error': 'Model onnx-gpt2 not found'}
Sure, the debugger console view of res variable of the above code is (The data part is so huge let me if that's also needed):
{'model_name': 'onnx-gpt2', 'model_version': '1', 'outputs': [{...}, {...}]}
special variables
function variables
'model_name':
'onnx-gpt2'
'model_version':
'1'
'outputs':
[{'name': 'past_key_values', 'datatype': 'FP32', 'shape': [...], 'data': [...]}, {'name': 'logits', 'datatype': 'FP32', 'shape': [...], 'data': [...]}]
special variables
function variables
0:
{'name': 'past_key_values', 'datatype': 'FP32', 'shape': [12, 2, 1, 12, 6, 64], 'data': [-1.5576592683792114, 2.0584793090820312, 1.3060319423675537, 0.2524546682834625, 0.9934902191162109, 0.35647904872894287, 0.4260801374912262, 0.7812842130661011, -2.27323579788208, ...]}
special variables
function variables
'name':
'past_key_values'
'datatype':
'FP32'
'shape':
[12, 2, 1, 12, 6, 64]
'data':
[-1.5576592683792114, 2.0584793090820312, 1.3060319423675537, 0.2524546682834625, 0.9934902191162109, 0.35647904872894287, 0.4260801374912262, 0.7812842130661011, -2.27323579788208, 0.23955117166042328, -0.13268378376960754, -0.05451150983572006, 0.7474139928817749, -0.470634788274765, ...]
len():
4
1:
{'name': 'logits', 'datatype': 'FP32', 'shape': [1, 6, 50257], 'data': [-39.23337936401367, -38.93778991699219, -41.76470184326172, -41.707618713378906, -40.768341064453125, -40.82052993774414, -38.552547454833984, -40.08134460449219, -38.026023864746094, ...]}
len():
2
len():
3
Hey @saeid93 ,
Could you share instead the raw JSON output?
Hey @adriangonz, Sure, there you go res.txt
Closing this. Please update if still an issue.