seldon-core Inconsistency of the notebook model used in GPT2 notebooks

Describe the bug

As described in the Custom pre-processors with the V2 protocol notebook, the model is adapted from Pretrained GPT2 Model Deployment Example notebook. However, I tried to use the model in the second notebook instead of Seldon core model and it resulted in error.

To reproduce

changed schema in the second notebook:

apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
  name: gpt2-separate
spec:
  annotations:
    seldon.io/engine-separate-pod: "true"
  protocol: v2
  predictors:
    - name: default
      graph:
        name: tokeniser-encoder
        children:
          - name: onnx-gpt2
            implementation: TRITON_SERVER
            modelUri: s3://language-models
            envSecretRefName: seldon-init-container-secret
            children:
              - name: tokeniser-decoder
            
      componentSpecs:
        - spec:
            containers:
            - name: onnx-gpt2
        - spec:
            containers:
              - name: tokeniser-encoder
                image: sdghafouri/gpt2-tokeniser:0.1.0
                imagePullPolicy: Always
                env:
                  # Use always a writable HuggingFace cache location regardless of the user
                  - name: TRANSFORMERS_CACHE
                    value: /opt/mlserver/.cache
                  - name: MLSERVER_MODEL_NAME
                    value: "tokeniser-encoder"
        - spec:
            containers:
              - name: tokeniser-decoder
                image: sdghafouri/gpt2-tokeniser:0.1.0
                imagePullPolicy: Always
                env:
                  - name: SELDON_TOKENIZER_TYPE
                    value: "DECODER"
                  # Use always a writable HuggingFace cache location regardless of the user
                  - name: TRANSFORMERS_CACHE
                    value: /opt/mlserver/.cache
                  - name: MLSERVER_MODEL_NAME
                    value: "tokeniser-decoder"

Error in the last part of the notebook:

%%bash
curl localhost:32000/seldon/default/gpt2-separate/v2/models/infer \
    -H 'Content-Type: application/json' \
    -d '{"inputs": [{"name": "sentences", "datatype": "BYTES", "shape": [1, 11], "data": ["Seldon Technologies is very"]}]}'

output:

concurrent.futures.process._RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/concurrent/futures/process.py", line 239, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/usr/local/lib/python3.8/site-packages/mlserver/parallel.py", line 57, in _mp_predict
    return asyncio.run(_mp_model.predict(payload))
  File "/usr/local/lib/python3.8/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/usr/local/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
    return future.result()
  File "/opt/mlserver/./runtime.py", line 37, in predict
    next_token_str = self._tokeniser.decode(
  File "/usr/local/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 3221, in decode
    return self._decode(
  File "/usr/local/lib/python3.8/site-packages/transformers/tokenization_utils.py", line 914, in _decode
    filtered_tokens = self.convert_ids_to_tokens(token_ids, skip_special_tokens=skip_special_tokens)
  File "/usr/local/lib/python3.8/site-packages/transformers/tokenization_utils.py", line 889, in convert_ids_to_tokens
    index = int(index)
TypeError: int() argument must be a string, a bytes-like object or a number, not 'list'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/starlette/middleware/errors.py", line 159, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.8/site-packages/starlette_exporter/middleware.py", line 135, in __call__
    await self.app(scope, receive, wrapped_send)
  File "/usr/local/lib/python3.8/site-packages/starlette/exceptions.py", line 82, in __call__
    raise exc
  File "/usr/local/lib/python3.8/site-packages/starlette/exceptions.py", line 71, in __call__
    await self.app(scope, receive, sender)
  File "/usr/local/lib/python3.8/site-packages/fastapi/middleware/asyncexitstack.py", line 21, in __call__
    raise e
  File "/usr/local/lib/python3.8/site-packages/fastapi/middleware/asyncexitstack.py", line 18, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.8/site-packages/starlette/routing.py", line 656, in __call__
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.8/site-packages/starlette/routing.py", line 259, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.8/site-packages/starlette/routing.py", line 61, in app
    response = await func(request)
  File "/usr/local/lib/python3.8/site-packages/mlserver/rest/app.py", line 27, in custom_route_handler
    return await original_route_handler(request)
  File "/usr/local/lib/python3.8/site-packages/fastapi/routing.py", line 227, in app
    raw_response = await run_endpoint_function(
  File "/usr/local/lib/python3.8/site-packages/fastapi/routing.py", line 160, in run_endpoint_function
    return await dependant.call(**values)
  File "/usr/local/lib/python3.8/site-packages/mlserver/rest/endpoints.py", line 59, in infer
    inference_response = await self._data_plane.infer(
  File "/usr/local/lib/python3.8/site-packages/mlserver/handlers/dataplane.py", line 60, in infer
    prediction = await model.predict(payload)
  File "/usr/local/lib/python3.8/site-packages/mlserver/parallel.py", line 117, in _inner
    return await pool.predict(payload)
  File "/usr/local/lib/python3.8/site-packages/mlserver/parallel.py", line 88, in predict
    return await loop.run_in_executor(self._executor, _mp_predict, payload)
TypeError: int() argument must be a string, a bytes-like object or a number, not 'list'

Expected behaviour

User-generated model through the notebook works the same as the model in the Seldon gs.

Environment

Cloud Provider: Bare Metal
Kubernetes Cluster Version

Client Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.2", GitCommit:"f66044f4361b9f1f96f0053dd46cb7dce5e990a8", GitTreeState:"clean", BuildDate:"2022-06-15T14:22:29Z", GoVersion:"go1.18.3", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.4
Server Version: version.Info{Major:"1", Minor:"24+", GitVersion:"v1.24.0-2+59bbb3530b6769", GitCommit:"59bbb3530b6769e4935a05ac0e13c9910c79253e", GitTreeState:"clean", BuildDate:"2022-05-13T06:41:13Z", GoVersion:"go1.18.1", Compiler:"gc", Platform:"linux/amd64"}

Deployed Seldon System Images:

value: docker.io/seldonio/seldon-core-executor:1.14.0
image: docker.io/seldonio/seldon-core-operator:1.14.0

Aug 03 '22 11:08 saeid93

@adriangonz any ideas?

Aug 15 '22 07:08 ukclivecox

Hey @saeid93 ,

Thanks for opening this ticket.

Would you be able to share the file / folder tree of the s3://language-models bucket? It would also be great if you could share the logs of the Triton container within your inference graph.

Aug 15 '22 16:08 adriangonz

Hey @adriangonz, sorry for the late reply.

This is the foldering:

mc tree minio/language-models
minio/language-models
└─ onnx-gpt2
   └─ 1

and

mc ls minio/language-models/onnx-gpt2/1
[2022-08-27 17:56:31 UTC] 622MiB STANDARD model.onnx

Here is also the output of the Triton server logs, looks like the model has been successfully loaded:

k logs gpt2-separate-default-0-onnx-gpt2-d84db944b-ntbwm                                                  
Defaulted container "onnx-gpt2" out of: onnx-gpt2, onnx-gpt2-model-initializer (init)

=============================
== Triton Inference Server ==
=============================

NVIDIA Release 21.08 (build 26170506)

Copyright (c) 2018-2021, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION.  All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
find: '/usr/lib/ssl/private': Permission denied

WARNING: The NVIDIA Driver was not detected.  GPU functionality will not be available.
   Use Docker with NVIDIA Container Toolkit to start this container; see
   https://github.com/NVIDIA/nvidia-docker.

I0827 18:05:48.236865 1 libtorch.cc:1029] TRITONBACKEND_Initialize: pytorch
I0827 18:05:48.236968 1 libtorch.cc:1039] Triton TRITONBACKEND API version: 1.4
I0827 18:05:48.236974 1 libtorch.cc:1045] 'pytorch' TRITONBACKEND API version: 1.4
2022-08-27 18:05:48.397308: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
I0827 18:05:48.434756 1 tensorflow.cc:2169] TRITONBACKEND_Initialize: tensorflow
I0827 18:05:48.434785 1 tensorflow.cc:2179] Triton TRITONBACKEND API version: 1.4
I0827 18:05:48.434790 1 tensorflow.cc:2185] 'tensorflow' TRITONBACKEND API version: 1.4
I0827 18:05:48.434794 1 tensorflow.cc:2209] backend configuration:
{}
I0827 18:05:48.436959 1 onnxruntime.cc:1970] TRITONBACKEND_Initialize: onnxruntime
I0827 18:05:48.436980 1 onnxruntime.cc:1980] Triton TRITONBACKEND API version: 1.4
I0827 18:05:48.436985 1 onnxruntime.cc:1986] 'onnxruntime' TRITONBACKEND API version: 1.4
I0827 18:05:48.457442 1 openvino.cc:1193] TRITONBACKEND_Initialize: openvino
I0827 18:05:48.457461 1 openvino.cc:1203] Triton TRITONBACKEND API version: 1.4
I0827 18:05:48.457467 1 openvino.cc:1209] 'openvino' TRITONBACKEND API version: 1.4
W0827 18:05:48.457585 1 pinned_memory_manager.cc:236] Unable to allocate pinned system memory, pinned memory pool will not be available: CUDA driver version is insufficient for CUDA runtime version
I0827 18:05:48.457607 1 cuda_memory_manager.cc:115] CUDA memory pool disabled
I0827 18:05:48.458017 1 model_repository_manager.cc:1045] loading: onnx-gpt2:1
I0827 18:05:48.558684 1 onnxruntime.cc:2029] TRITONBACKEND_ModelInitialize: onnx-gpt2 (version 1)
I0827 18:05:50.940684 1 onnxruntime.cc:2072] TRITONBACKEND_ModelInstanceInitialize: onnx-gpt2 (CPU device 0)
I0827 18:05:52.838040 1 model_repository_manager.cc:1212] successfully loaded 'onnx-gpt2' version 1
I0827 18:05:52.838357 1 server.cc:504] 
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I0827 18:05:52.838585 1 server.cc:543] 
+-------------+-----------------------------------------------------------------+--------+
| Backend     | Path                                                            | Config |
+-------------+-----------------------------------------------------------------+--------+
| tensorrt    | <built-in>                                                      | {}     |
| pytorch     | /opt/tritonserver/backends/pytorch/libtriton_pytorch.so         | {}     |
| tensorflow  | /opt/tritonserver/backends/tensorflow1/libtriton_tensorflow1.so | {}     |
| onnxruntime | /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.so | {}     |
| openvino    | /opt/tritonserver/backends/openvino/libtriton_openvino.so       | {}     |
+-------------+-----------------------------------------------------------------+--------+

I0827 18:05:52.838673 1 server.cc:586] 
+-----------+---------+--------+
| Model     | Version | Status |
+-----------+---------+--------+
| onnx-gpt2 | 1       | READY  |
+-----------+---------+--------+

I0827 18:05:52.838896 1 tritonserver.cc:1718] 
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                                                                                                  |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                                                                                                 |
| server_version                   | 2.13.0                                                                                                                                                                                 |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data statistics |
| model_repository_path[0]         | /mnt/models                                                                                                                                                                            |
| model_control_mode               | MODE_NONE                                                                                                                                                                              |
| strict_model_config              | 0                                                                                                                                                                                      |
| pinned_memory_pool_byte_size     | 268435456                                                                                                                                                                              |
| min_supported_compute_capability | 6.0                                                                                                                                                                                    |
| strict_readiness                 | 1                                                                                                                                                                                      |
| exit_timeout                     | 30                                                                                                                                                                                     |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0827 18:05:52.842717 1 grpc_server.cc:4111] Started GRPCInferenceService at 0.0.0.0:9500
I0827 18:05:52.843503 1 http_server.cc:2803] Started HTTPService at 0.0.0.0:9000
I0827 18:05:52.886133 1 http_server.cc:162] Started Metrics Service at 0.0.0.0:8002

I still get the same error as mentioned in the original comment. It works fine with the seldon-hosted gpt2 image but using the source notebook model (as mentioned in the doc) is returning some error.

Aug 27 '22 18:08 saeid93

Hey @saeid93 ,

Thanks for providing that info.

As you say, it does seem like Triton is loading the model just fine. Therefore, it could be that something has changed in the interface of the model, and that's what's causing incompatibilities with the rest of the graph components.

To double check that,

Could you share the output of querying the model metadata of this model? This would be similar to what it's done in this step of the original example: https://docs.seldon.io/projects/seldon-core/en/latest/examples/triton_gpt2_example.html#Interact-with-the-model:-get-model-metadata-(a-
Could you send a "standalone" request to the Triton container and share its response? This would be similar to what it's done in this step of the original example: https://docs.seldon.io/projects/seldon-core/en/latest/examples/triton_gpt2_example.html#Run-prediction-test:-generate-a-sentence-completion-using-GPT2-model---Greedy-approach

Aug 31 '22 08:08 adriangonz

Hey @adriangonz,

You're welcome Sure, here is the requested information:

using curl -s http://localhost:32000/seldon/default/gpt2-separate/v2/models/onnx-gpt2 I get the following,

{"status":{"code":500,"info":"Get \"http://gpt2-separate-default-onnx-gpt2.default.svc.cluster.local.:9001/v2/models/onnx-gpt2\": dial tcp 10.152.183.124:9001: i/o timeout","status":"FAILURE"}}%

Not sure about the definition of the standalone process, assuming it means trying to directly accessing the triton server node I ran the following:

import json

import numpy as np
import requests
from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
input_text = "I enjoy working in Seldon"
count = 0
max_gen_len = 10
gen_sentence = input_text
while count < max_gen_len:
    input_ids = tokenizer.encode(gen_sentence, return_tensors="tf")
    shape = input_ids.shape.as_list()
    payload = {
        "inputs": [
            {
                "name": "input_ids",
                "datatype": "INT32",
                "shape": shape,
                "data": input_ids.numpy().tolist(),
            },
            {
                "name": "attention_mask",
                "datatype": "INT32",
                "shape": shape,
                "data": np.ones(shape, dtype=np.int32).tolist(),
            },
        ]
    }

    ret = requests.post(
        "http://localhost:32000/seldon/default/gpt2-separate/v2/models/onnx-gpt2/infer",
        json=payload,
    )

    try:
        res = ret.json()
    except:
        continue
    print(res)

It results to the following:

{'error': 'Model onnx-gpt2 not found'}

Sep 02 '22 22:09 saeid93

Hey @saeid93 ,

It seems there may be some nuances with the executor. To avoid getting sidetracked by those, would be able to deploy the Triton model separately (as in, without the pre-/post-processors) and run the same requests as above?

Sep 05 '22 13:09 adriangonz

Hey @adriangonz,

Using the following:

apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
  name: gpt2-separate
spec:
  annotations:
    seldon.io/engine-separate-pod: "true"
  protocol: v2
  predictors:
    - name: default
      graph:
          name: onnx-gpt2 
          implementation: TRITON_SERVER
          modelUri: s3://language-models
          envSecretRefName: seldon-init-container-secret

      componentSpecs:
        - spec:
            containers:
            - name: onnx-gpt2

And it can successfully run the model and return the un-post-processed outputs. Any idea why this is happening? Shouldn't the both Seldon hosted model and the model I have generated behave the same?

Sep 05 '22 19:09 saeid93

There's clearly some difference between your artefact and the one hosted by Seldon. What we're trying to do now is identify what that difference is. So far, we've managed to establish that the folder structure and model settings look similar. Next step is to validate whether the outputs it sends back are the same as the model hosted by Seldon.

With that in mind, would you be able to share the outputs you got @saeid93 ?

Sep 06 '22 13:09 adriangonz

Hey @adriangonz,

You're welcome Sure, here is the requested information:

using curl -s http://localhost:32000/seldon/default/gpt2-separate/v2/models/onnx-gpt2 I get the following,

{"status":{"code":500,"info":"Get \"http://gpt2-separate-default-onnx-gpt2.default.svc.cluster.local.:9001/v2/models/onnx-gpt2\": dial tcp 10.152.183.124:9001: i/o timeout","status":"FAILURE"}}%

Not sure about the definition of the standalone process, assuming it means trying to directly accessing the triton server node I ran the following:

import json

import numpy as np
import requests
from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
input_text = "I enjoy working in Seldon"
count = 0
max_gen_len = 10
gen_sentence = input_text
while count < max_gen_len:
    input_ids = tokenizer.encode(gen_sentence, return_tensors="tf")
    shape = input_ids.shape.as_list()
    payload = {
        "inputs": [
            {
                "name": "input_ids",
                "datatype": "INT32",
                "shape": shape,
                "data": input_ids.numpy().tolist(),
            },
            {
                "name": "attention_mask",
                "datatype": "INT32",
                "shape": shape,
                "data": np.ones(shape, dtype=np.int32).tolist(),
            },
        ]
    }

    ret = requests.post(
        "http://localhost:32000/seldon/default/gpt2-separate/v2/models/onnx-gpt2/infer",
        json=payload,
    )

    try:
        res = ret.json()
    except:
        continue
    print(res)

It results to the following:

{'error': 'Model onnx-gpt2 not found'}

Sure, the debugger console view of res variable of the above code is (The data part is so huge let me if that's also needed):

{'model_name': 'onnx-gpt2', 'model_version': '1', 'outputs': [{...}, {...}]}
special variables
function variables
'model_name':
'onnx-gpt2'
'model_version':
'1'
'outputs':
[{'name': 'past_key_values', 'datatype': 'FP32', 'shape': [...], 'data': [...]}, {'name': 'logits', 'datatype': 'FP32', 'shape': [...], 'data': [...]}]
special variables
function variables
0:
{'name': 'past_key_values', 'datatype': 'FP32', 'shape': [12, 2, 1, 12, 6, 64], 'data': [-1.5576592683792114, 2.0584793090820312, 1.3060319423675537, 0.2524546682834625, 0.9934902191162109, 0.35647904872894287, 0.4260801374912262, 0.7812842130661011, -2.27323579788208, ...]}
special variables
function variables
'name':
'past_key_values'
'datatype':
'FP32'
'shape':
[12, 2, 1, 12, 6, 64]
'data':
[-1.5576592683792114, 2.0584793090820312, 1.3060319423675537, 0.2524546682834625, 0.9934902191162109, 0.35647904872894287, 0.4260801374912262, 0.7812842130661011, -2.27323579788208, 0.23955117166042328, -0.13268378376960754, -0.05451150983572006, 0.7474139928817749, -0.470634788274765, ...]
len():
4
1:
{'name': 'logits', 'datatype': 'FP32', 'shape': [1, 6, 50257], 'data': [-39.23337936401367, -38.93778991699219, -41.76470184326172, -41.707618713378906, -40.768341064453125, -40.82052993774414, -38.552547454833984, -40.08134460449219, -38.026023864746094, ...]}
len():
2
len():
3

Sep 06 '22 16:09 saeid93

Hey @saeid93 ,

Could you share instead the raw JSON output?

Sep 06 '22 16:09 adriangonz

Hey @adriangonz, Sure, there you go res.txt

Sep 06 '22 16:09 saeid93

Closing this. Please update if still an issue.

Mar 04 '23 10:03 ukclivecox

seldon-core seldon-core copied to clipboard

Inconsistency of the notebook model used in GPT2 notebooks

Describe the bug

To reproduce

Expected behaviour

Environment

seldon-core
seldon-core copied to clipboard