serve
serve copied to clipboard
Metrics API returns empty response until TS process serves a prediction
Context
When the torchserve process initially starts, metrics API endpoints tested return an empty response. While this is a niche case (most likely TS would have served at least 1 prediction before the user calls the metrics APIs), it seems as though the API is broken.
- torchserve version: Installed from source on latest master
- torch version: 1.6.0
- torchvision version 0.7.0
- java version: openjdk-11
- Operating System and version: Ubuntu 18.04
Your Environment
- Installed using source? [yes/no]: Yes
- Are you planning to deploy it using docker container? [yes/no]:N/A
- Is it a CPU or GPU environment?: CPU
- Using a default/custom handler? [If possible upload/share custom handler/model]:No
- What kind of model is it e.g. vision, text, audio?:N/A
- Are you planning to use local models from model-store or public url being used e.g. from S3 bucket etc.? [If public url then provide link.]:N/A
- Provide config.properties, logs [ts.log] and parameters used for model registration/update APIs: Did not use a config.properties (so just the default config)
- Link to your project [if any]:N/A
Expected Behavior
The /metrics
endpoint should return the list of metrics whether or not a prediction had been made.
The /metrics?name
endpoint should return 0s or display that there were no request logged.
Current Behavior
Both endpoints returns an empty response.
Possible Solution
Steps to Reproduce
How to reproduce:
- Start a new Torchserve process.
- Run
curl http://127.0.0.1:8082/metrics
and verify that there is no response. - Make a call to a prediction endpoint. e.g.
curl http://127.0.0.1:8080/predictions/densenet161 -T kitten.jpg
- Run
curl http://127.0.0.1:8082/metrics
again and verify the expected response is returned (the list of metrics).
Same behavior for curl "http://127.0.0.1:8082/metrics?name[]=ts_inference_latency_microseconds&name[]=ts_queue_latency_microseconds" --globoff
Failure Logs [if any]
Note that the metrics API doesn't return results until the first prediction is served. metrics_api_success.txt
This should be taken up as part of converting Prometheus integration into a plugin (#611).
@maheshambule Could you please validate this?
@harshbafna A simple fix with message "No data" in the body should be fine. At present it is confusing for users when they get 200K response but nothing in metrics on initial install and think the metrics endpoint is broken. Adding this message is independent of prometheus integration
@harshbafna Looking at the prometheus java client code, it seems a metric is initialized during register only if doesn't have a label
Currently, I still have similar problems, but even after model inference, an empty result is still returned curl http://127.0.0.1:8082/metrics
@pengxin233 did you solve this problem? I am experiencing this issue.
print(requests.get("http://127.0.0.1:8082/metrics?").content)
returns b''
even if I have done inference.
I do see some metrics in the pod's log:
[INFO ] W-9000-mnist_1.0 TS_METRICS - ts_queue_latency_microseconds.Microseconds:98.194|#model_name:mnist,model_version:default
I am using the MNIST sample. The config in that container is:
$ cat /mnt/models/config/config.properties
inference_address=http://0.0.0.0:8085
management_address=http://0.0.0.0:8085
metrics_address=http://0.0.0.0:8082
grpc_inference_port=7070
grpc_management_port=7071
enable_metrics_api=true
metrics_format=prometheus
number_of_netty_threads=4
job_queue_size=10
enable_envvars_config=true
install_py_dep_per_model=true
model_store=/mnt/models/model-store
model_snapshot={"name":"startup.cfg","modelCount":1,"models":{"mnist":{"1.0":{"defaultVersion":true,"marName":"mnist.mar","minWorkers":1,"maxWorkers":5,"batchSize":1,"maxBatchDelay":10,"responseTimeout":120}}}}
I also see some metrics in /home/model-server/logs/ts_metrics.log