loudml
loudml copied to clipboard
HTTP 500 Internal server error when cancelling a job, new jobs stuck in waiting on Docker
> version
1.6.0
> list-models
bah
mymodel
Job is initially waiting:
> list-jobs
id name state x/N time_left duration
2148faaf-b7b8-436a-82d0-d9b91d74c9ea training(bah) waiting 496.464 <-----
4dd2fb2c-b629-4aad-93a5-ce4df0626121 training(mymodel) canceling 1687.226
6eacbedc-c289-4f06-9360-a05e199ac5eb training(mymodel) canceling 1783.785
8f411abc-7a96-440b-b550-505c83f021a5 training(bah) waiting 684.152
da80bb65-f1e1-4d42-8c89-1feb49190bf0 training(mymodel) canceling 1817.056
Then I try to cancel it:
> cancel-job 2148faaf-b7b8-436a-82d0-d9b91d74c9ea
ERROR:root:TransportError(500, 'internal server error')
When I check, it's actually trying to cancel it but it's stuck at cancelling for ever:
> list-jobs
id name state x/N time_left duration
2148faaf-b7b8-436a-82d0-d9b91d74c9ea training(bah) canceling 1134.971 <----
4dd2fb2c-b629-4aad-93a5-ce4df0626121 training(mymodel) canceling 2325.733
6eacbedc-c289-4f06-9360-a05e199ac5eb training(mymodel) canceling 2422.291
8f411abc-7a96-440b-b550-505c83f021a5 training(bah) waiting 1322.659
da80bb65-f1e1-4d42-8c89-1feb49190bf0 training(mymodel) canceling 2455.562
For reference:
> list-models
bah
mymodel
> show-model mymodel
- settings:
bucket_interval: 10s
default_bucket: influxdb
features:
- anomaly_type: low_high
default: null
field: average_response_ms
io: io
match_all:
- tag: url
value: host1.example.com
measurement: ping
metric: mean
name: myfeature
grace_period: 0
interval: 60s
max_evals: 20
max_threshold: 0
min_threshold: 0
name: mymodel
offset: 10s
seasonality:
daytime: false
weekday: false
span: 10
type: donut
training:
job_id: eebdaf5e-bdad-4936-a8aa-f8595b078120
state: failed
>
I suspect it may have something to do with the database? How do I check if it's actually connected and writing to DB?
> list-buckets
influxdb
loudml
> show-bucket loudml
- addr: myinfluxdb:8086
database: loudml
dbuser: admin
name: loudml
type: influxdb
>
Externally, I've already confirmed I can access myinfluxdb, I can write data, etc.
I also see this weird Python exception in the logs when I try to run the training:
10.103.130.201 - - [2020-08-19 19:02:11] "GET /models HTTP/1.1" 200 1205 0.003390
[2020-08-19 19:02:12,667] ERROR in app: Exception on /models/bah/_train [POST]
Traceback (most recent call last):
File "/opt/venv/lib/python3.7/site-packages/flask/app.py", line 2446, in wsgi_app
response = self.full_dispatch_request()
File "/opt/venv/lib/python3.7/site-packages/flask/app.py", line 1951, in full_dispatch_request
rv = self.handle_user_exception(e)
File "/opt/venv/lib/python3.7/site-packages/flask_restful/__init__.py", line 269, in error_router
return original_handler(e)
File "/opt/venv/lib/python3.7/site-packages/flask/app.py", line 1820, in handle_user_exception
reraise(exc_type, exc_value, tb)
File "/opt/venv/lib/python3.7/site-packages/flask/_compat.py", line 39, in reraise
raise value
File "/opt/venv/lib/python3.7/site-packages/flask/app.py", line 1949, in full_dispatch_request
rv = self.dispatch_request()
File "/opt/venv/lib/python3.7/site-packages/flask/app.py", line 1935, in dispatch_request
return self.view_functions[rule.endpoint](**req.view_args)
File "/opt/venv/lib/python3.7/site-packages/loudml/server.py", line 932, in model_train
job.start(g_config)
File "/opt/venv/lib/python3.7/site-packages/loudml/server.py", line 1430, in start
kwargs=self.kwargs,
File "/opt/venv/lib/python3.7/site-packages/pebble/pool/process.py", line 84, in schedule
self._check_pool_state()
File "/opt/venv/lib/python3.7/site-packages/pebble/pool/base_pool.py", line 94, in _check_pool_state
raise RuntimeError('Unexpected error within the Pool')
RuntimeError: Unexpected error within the Pool
ERROR:loudml.server:Exception on /models/bah/_train [POST]
Traceback (most recent call last):
File "/opt/venv/lib/python3.7/site-packages/flask/app.py", line 2446, in wsgi_app
response = self.full_dispatch_request()
File "/opt/venv/lib/python3.7/site-packages/flask/app.py", line 1951, in full_dispatch_request
rv = self.handle_user_exception(e)
File "/opt/venv/lib/python3.7/site-packages/flask_restful/__init__.py", line 269, in error_router
return original_handler(e)
File "/opt/venv/lib/python3.7/site-packages/flask/app.py", line 1820, in handle_user_exception
reraise(exc_type, exc_value, tb)
File "/opt/venv/lib/python3.7/site-packages/flask/_compat.py", line 39, in reraise
raise value
File "/opt/venv/lib/python3.7/site-packages/flask/app.py", line 1949, in full_dispatch_request
rv = self.dispatch_request()
File "/opt/venv/lib/python3.7/site-packages/flask/app.py", line 1935, in dispatch_request
return self.view_functions[rule.endpoint](**req.view_args)
File "/opt/venv/lib/python3.7/site-packages/loudml/server.py", line 932, in model_train
job.start(g_config)
File "/opt/venv/lib/python3.7/site-packages/loudml/server.py", line 1430, in start
kwargs=self.kwargs,
File "/opt/venv/lib/python3.7/site-packages/pebble/pool/process.py", line 84, in schedule
self._check_pool_state()
File "/opt/venv/lib/python3.7/site-packages/pebble/pool/base_pool.py", line 94, in _check_pool_state
raise RuntimeError('Unexpected error within the Pool')
RuntimeError: Unexpected error within the Pool
10.103.130.201 - - [2020-08-19 19:02:12] "POST /models/bah/_train?from=2020-08-19T18:59:00.000Z&to=2020-08-19T18:59:00.000Z HTTP/1.1" 500 156 0.005088
10.103.130.201 - - [2020-08-19 19:02:12] "GET /jobs/4c17ee83-3ebf-419f-8e56-3aa6808b6f42 HTTP/1.1" 404 134 0.000463
Any pointers would be appreciated. I also tried nightly, 1.5.0, they all show the same error.
I'm running into pretty much the exact same issue with my kubernetes deployed version, but with an Elasticsearch bucket.
I initially thought it was a DB thing as well, but no longer think so. I realized there was an error in my bucket configuration, but updating it to be correct didn't change the error at all. I also don't think failing when cancelling a job should have anything to do with the db.
Have you had any luck figuring it out?
@regel any suggestions for next troubleshooting steps?
To add to this, not sure if it's related, but when I shell into the pod in kubernetes, I don't have a user assigned I have no name!@loudml-0:/
. More than likely I just need to refactor the helm chart for 1.6.0, but figured I'd add it in case it helps.
As I mentioned in my previous comment, changing the database info doesn't seem to have had any effect. Is there anyway I can test this to rule it out? I validated credentials are correct, but not sure how to get confirmation loudml is able to access.
Adding my replication to this ticket getting same result as OP:
using the 1.6.0 docker image, I am unable to start or cancel processes without error
Below are all the creation steps for each resource and the output. Stack trace from the logs is below as well. The job will stay in waiting until I try to cancel it. When I try to cancel it, it errors and then stays in canceling until I restart.
Model Creation:
curl -XPOST localhost:8077/models --data-binary "@/home/me/Documents/deny-model.json" -H "Content-Type: application/json"
{
"bucket_interval": "1m",
"default_datasource": "firewall-count-index",
"features": [
{
"default": 0,
"field": "Deny_count",
"metric": "avg",
"name": "avg_denies"
}
],
"grace_period": 0,
"interval": 60,
"max_evals": 10,
"max_threshold": 90,
"min_threshold": 50,
"name": "avg-denied-counts-redo",
"offset": 30,
"span": 5,
"timestamp_field": "@timestamp",
"type": "donut"
}
Bucket Creation:
curl -XPOST localhost:8077buckets --data-binary "@/home/me/Documents/firewallbucket.json" -H "Content-Type: application/json"
{
"addr": "https://k8s-cluster-es.namespace.cluster.local:9200",
"dbuser": "svc_loudml",
"doc_type": "_doc",
"index": "datacount.all-*",
"name": "firewall-count-index",
"type": "elasticsearch",
"use_ssl": true,
"verify_ssl": false
"dbuser_password": ""
}
Train Model
curl -X POST localhost:8077/models/avg-denied-counts-redo/_train?from=now-30d
[1] 119731
internal server error[1]+ Done
Check job to see status
curl localhost:8077/jobs/e7176a7e-3eca-4c6c-b525-6a61f494f7cb | jq
{
"duration": 52.941789,
"id": "e7176a7e-3eca-4c6c-b525-6a61f494f7cb",
"model": "avg-denied-counts-redo",
"result": null,
"start_date": "Tue Sep 1 19:04:09 2020",
"start_timestamp": 1598987049.511435,
"state": "waiting",
"type": "training"
}
stack trace
[2020-09-01 19:04:09,511] ERROR in app: Exception on /models/avg-denied-counts-redo/_train [POST]
Traceback (most recent call last):
File "/opt/vendor/lib/python3.5/site-packages/flask/app.py", line 2446, in wsgi_app
response = self.full_dispatch_request()
File "/opt/vendor/lib/python3.5/site-packages/flask/app.py", line 1951, in full_dispatch_request
rv = self.handle_user_exception(e)
File "/opt/vendor/lib/python3.5/site-packages/flask_restful/__init__.py", line 269, in error_router
return original_handler(e)
File "/opt/vendor/lib/python3.5/site-packages/flask/app.py", line 1820, in handle_user_exception
reraise(exc_type, exc_value, tb)
File "/opt/vendor/lib/python3.5/site-packages/flask/_compat.py", line 39, in reraise
raise value
File "/opt/vendor/lib/python3.5/site-packages/flask/app.py", line 1949, in full_dispatch_request
rv = self.dispatch_request()
File "/opt/vendor/lib/python3.5/site-packages/flask/app.py", line 1935, in dispatch_request
return self.view_functions[rule.endpoint](**req.view_args)
File "/opt/vendor/lib/python3.5/site-packages/loudml/server.py", line 932, in model_train
job.start(g_config)
File "/opt/vendor/lib/python3.5/site-packages/loudml/server.py", line 1430, in start
kwargs=self.kwargs,
File "/opt/vendor/lib/python3.5/site-packages/pebble/pool/process.py", line 84, in schedule
self._check_pool_state()
File "/opt/vendor/lib/python3.5/site-packages/pebble/pool/base_pool.py", line 94, in _check_pool_state
raise RuntimeError('Unexpected error within the Pool')
RuntimeError: Unexpected error within the Pool
ERROR:loudml.server:Exception on /models/avg-denied-counts-redo/_train [POST]
Traceback (most recent call last):
File "/opt/vendor/lib/python3.5/site-packages/flask/app.py", line 2446, in wsgi_app
response = self.full_dispatch_request()
File "/opt/vendor/lib/python3.5/site-packages/flask/app.py", line 1951, in full_dispatch_request
rv = self.handle_user_exception(e)
File "/opt/vendor/lib/python3.5/site-packages/flask_restful/__init__.py", line 269, in error_router
return original_handler(e)
File "/opt/vendor/lib/python3.5/site-packages/flask/app.py", line 1820, in handle_user_exception
reraise(exc_type, exc_value, tb)
File "/opt/vendor/lib/python3.5/site-packages/flask/_compat.py", line 39, in reraise
raise value
File "/opt/vendor/lib/python3.5/site-packages/flask/app.py", line 1949, in full_dispatch_request
rv = self.dispatch_request()
File "/opt/vendor/lib/python3.5/site-packages/flask/app.py", line 1935, in dispatch_request
return self.view_functions[rule.endpoint](**req.view_args)
File "/opt/vendor/lib/python3.5/site-packages/loudml/server.py", line 932, in model_train
job.start(g_config)
File "/opt/vendor/lib/python3.5/site-packages/loudml/server.py", line 1430, in start
kwargs=self.kwargs,
File "/opt/vendor/lib/python3.5/site-packages/pebble/pool/process.py", line 84, in schedule
self._check_pool_state()
File "/opt/vendor/lib/python3.5/site-packages/pebble/pool/base_pool.py", line 94, in _check_pool_state
raise RuntimeError('Unexpected error within the Pool')
RuntimeError: Unexpected error within the Pool
127.0.0.1 - - [2020-09-01 19:04:09] "POST /models/avg-denied-counts-redo/_train?from=now-30d HTTP/1.1" 500 156 0.007398
Attempt to cancel job
curl -X POST localhost:8077/jobs/e7176a7e-3eca-4c6c-b525-6a61f494f7cb/_cancel
internal server error
curl localhost:8077/jobs/e7176a7e-3eca-4c6c-b525-6a61f494f7cb | jq
{
"duration": 1229.536151,
"id": "e7176a7e-3eca-4c6c-b525-6a61f494f7cb",
"model": "avg-denied-counts-redo",
"result": null,
"start_date": "Tue Sep 1 19:04:09 2020",
"start_timestamp": 1598987049.511435,
"state": "canceling",
"type": "training"
}
No stack trace is generated when the job fails to cancel.