server icon indicating copy to clipboard operation
server copied to clipboard

Error running `wandb init` with `docker-compose`

Open ethanabrooks opened this issue 4 years ago • 10 comments

Thanks for your help with my previous issue. This issue is not too far from the previous so hopefully it is just as painless :)

So here is my docker-compose.yml:

version: "3.9"
services:
  wandb:
    image: "wandb/local:0.9.38"
    ports:
      - 8080:8080
    volumes:
      - wandb:/vol
  run:
    environment:
      - WANDB_API_KEY=local-<key>
      - WANDB_BASE_URL=http://wandb:8080
    image: "wandb-issue"
    command: ["wandb", "init"]

volumes:
  wandb:

Here is the Dockerfile for wandb-issue:

FROM nvidia/cuda:11.2.1-cudnn8-devel-ubuntu20.04

RUN apt-get update -q \
 && DEBIAN_FRONTEND="noninteractive" \
    apt-get install -yq \
      python3 \
      python3-pip \
 && apt-get clean

RUN pip3 install wandb
COPY config.yml .

This is the tail of the error:

run_1    | Retry attempt failed:
run_1    | Traceback (most recent call last):
run_1    |   File "/usr/local/lib/python3.8/dist-packages/wandb/old/retry.py", line 96, in __call__
run_1    |     result = self._call_fn(*args, **kwargs)
run_1    |   File "/usr/local/lib/python3.8/dist-packages/wandb/sdk/internal/internal_api.py", line 129, in execute
run_1    |     six.reraise(*sys.exc_info())
run_1    |   File "/usr/local/lib/python3.8/dist-packages/six.py", line 703, in reraise
run_1    |     raise value
run_1    |   File "/usr/local/lib/python3.8/dist-packages/wandb/sdk/internal/internal_api.py", line 123, in execute
run_1    |     return self.client.execute(*args, **kwargs)
run_1    |   File "/usr/local/lib/python3.8/dist-packages/wandb/vendor/gql-0.2.0/gql/client.py", line 52, in execute
run_1    |     result = self._get_result(document, *args, **kwargs)
run_1    |   File "/usr/local/lib/python3.8/dist-packages/wandb/vendor/gql-0.2.0/gql/client.py", line 60, in _get_result
run_1    |     return self.transport.execute(document, *args, **kwargs)
run_1    |   File "/usr/local/lib/python3.8/dist-packages/wandb/vendor/gql-0.2.0/gql/transport/requests.py", line 39, in execute
run_1    |     request.raise_for_status()
run_1    |   File "/usr/local/lib/python3.8/dist-packages/requests/models.py", line 943, in raise_for_status
run_1    |     raise HTTPError(http_error_msg, response=self)
run_1    | requests.exceptions.HTTPError: 502 Server Error: Bad Gateway for url: http://wandb:8080/graphql
run_1    | wandb: Network error (HTTPError), entering retry loop. See wandb/debug-internal.log for full traceback.
run_1    | wandb: Network error resolved after 0:00:08.773061, resuming normal operation.
run_1    | Traceback (most recent call last):
run_1    |   File "/usr/local/bin/wandb", line 8, in <module>
run_1    |     sys.exit(cli())
run_1    |   File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 829, in __call__
run_1    |     return self.main(*args, **kwargs)
run_1    |   File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 782, in main
run_1    |     rv = self.invoke(ctx)
run_1    |   File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1259, in invoke
run_1    |     return _process_result(sub_ctx.command.invoke(sub_ctx))
run_1    |   File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1066, in invoke
run_1    |     return ctx.invoke(self.callback, **ctx.params)
run_1    |   File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 610, in invoke
run_1    |     return callback(*args, **kwargs)
run_1    |   File "/usr/local/lib/python3.8/dist-packages/click/decorators.py", line 21, in new_func
run_1    |     return f(get_current_context(), *args, **kwargs)
run_1    |   File "/usr/local/lib/python3.8/dist-packages/wandb/cli/cli.py", line 94, in wrapper
run_1    |     return func(*args, **kwargs)
run_1    |   File "/usr/local/lib/python3.8/dist-packages/wandb/cli/cli.py", line 384, in init
run_1    |     project = prompt_for_project(ctx, entity)
run_1    |   File "/usr/local/lib/python3.8/dist-packages/wandb/cli/cli.py", line 140, in prompt_for_project
run_1    |     result = whaaaaat.prompt([question])
run_1    |   File "/usr/local/lib/python3.8/dist-packages/wandb/vendor/whaaaaat/prompt.py", line 59, in prompt
run_1    |     answer = run_application(
run_1    |   File "/usr/local/lib/python3.8/dist-packages/wandb/vendor/prompt_toolkit/shortcuts.py", line 576, in run_application
run_1    |     output=create_output(true_color=true_color))
run_1    |   File "/usr/local/lib/python3.8/dist-packages/wandb/vendor/prompt_toolkit/shortcuts.py", line 124, in create_output
run_1    |     return Vt100_Output.from_pty(
run_1    |   File "/usr/local/lib/python3.8/dist-packages/wandb/vendor/prompt_toolkit/terminal/vt100_output.py", line 424, in from_pty
run_1    |     assert stdout.isatty()
run_1    | AssertionError
jax-agents_run_1 exited with code 1

The full error is in this gist: https://gist.github.com/ethanabrooks/c5f7c4c0577e99c8ea8266aee83a2955

ethanabrooks avatar Mar 15 '21 21:03 ethanabrooks

Hey @ethanabrooks wandb init is meant for initializing a directory and requires a terminal that has user input. What you likely want to be doing is calling wandb.init(project="test-project") inside of a python script.

vanpelt avatar Mar 15 '21 21:03 vanpelt

Also, can you share a little more information about why you're using docker-compose and running a local server instead of using our cloud service? Generally users deploy the local service into their own AWS, Azure, or GCP and configure a database with backups and scalable storage.

vanpelt avatar Mar 15 '21 21:03 vanpelt

Sure. I started running into a lot of rate-limit errors on the cloud service. Apparently since January, you can only run 15 runs concurrently on the site. I expect that running locally would give me more flexibility and control.

ethanabrooks avatar Mar 15 '21 21:03 ethanabrooks

Hey @ethanabrooks wandb init is meant for initializing a directory and requires a terminal that has user input. What you likely want to be doing is calling wandb.init(project="test-project") inside of a python script.

I get a very similar error with the following docker-compose.yml:

version: "3.9"
services:
  wandb:
    image: "wandb/local"
    ports:
      - 8080:8080
    volumes:
      - wandb:/vol
  run:
    environment:
      - WANDB_API_KEY=local-<key>
      - WANDB_BASE_URL=http://wandb:8080
    image: "wandb-issue"
    command:
      ["python3", "-c", "import wandb; wandb.init(project='test-project')"]

volumes:
  wandb:
run_1    | Retry attempt failed:
run_1    | Traceback (most recent call last):
run_1    |   File "/usr/local/lib/python3.8/dist-packages/wandb/old/retry.py", line 96, in __call__
run_1    |     result = self._call_fn(*args, **kwargs)
run_1    |   File "/usr/local/lib/python3.8/dist-packages/wandb/sdk/internal/internal_api.py", line 129, in execute
run_1    |     six.reraise(*sys.exc_info())
run_1    |   File "/usr/local/lib/python3.8/dist-packages/six.py", line 703, in reraise
run_1    |     raise value
run_1    |   File "/usr/local/lib/python3.8/dist-packages/wandb/sdk/internal/internal_api.py", line 123, in execute
run_1    |     return self.client.execute(*args, **kwargs)
run_1    |   File "/usr/local/lib/python3.8/dist-packages/wandb/vendor/gql-0.2.0/gql/client.py", line 52, in execute
run_1    |     result = self._get_result(document, *args, **kwargs)
run_1    |   File "/usr/local/lib/python3.8/dist-packages/wandb/vendor/gql-0.2.0/gql/client.py", line 60, in _get_result
run_1    |     return self.transport.execute(document, *args, **kwargs)
run_1    |   File "/usr/local/lib/python3.8/dist-packages/wandb/vendor/gql-0.2.0/gql/transport/requests.py", line 39, in execute
run_1    |     request.raise_for_status()
run_1    |   File "/usr/local/lib/python3.8/dist-packages/requests/models.py", line 943, in raise_for_status
run_1    |     raise HTTPError(http_error_msg, response=self)
run_1    | requests.exceptions.HTTPError: 502 Server Error: Bad Gateway for url: http://wandb:8080/graphql

(full error here: https://gist.github.com/ethanabrooks/459f06bc4eb6905b860e7131beac7507)

ethanabrooks avatar Mar 15 '21 21:03 ethanabrooks

What do you see when you goto http://localhost:8080? The 502 error, means the wandb service isn't configured properly o failed to boot for some reason.

You likely don't want to be doing this for a number of reasons:

  1. You risk catastrophic data loss as the database and data stores will not be backed up.
  2. You can't share your results with anyone unless you run a server that all users can access.
  3. You won't get the latest updates unless you manually update / maintain the server

You can definitely run more than 15 jobs concurrently on the cloud offering. We limit you to 50 requests per second and even when you're being limited we always retry and eventually all data will be saved to the interface. If you were seeing rate limit errors in the UI, this is a known issue and we're pushing a fix to cloud next week.

vanpelt avatar Mar 15 '21 22:03 vanpelt

I see. I think there are ways that we have to deal with some of those issues. Our servers are accessible to outside machines and we can always pull updates from dockerhub. But it sounds like I am not really using the wandb/local tool the way it was intended.

To answer your question, if I replace http://wandb:8080 with http://localhost:8080, I get:

run_1    | Retry attempt failed:
run_1    | Traceback (most recent call last):
run_1    |   File "/usr/local/lib/python3.8/dist-packages/urllib3/connection.py", line 169, in _new_conn
run_1    |     conn = connection.create_connection(
run_1    |   File "/usr/local/lib/python3.8/dist-packages/urllib3/util/connection.py", line 96, in create_connection
run_1    |     raise err
run_1    |   File "/usr/local/lib/python3.8/dist-packages/urllib3/util/connection.py", line 86, in create_connection
run_1    |     sock.connect(sa)
run_1    | ConnectionRefusedError: [Errno 111] Connection refused
run_1    |
run_1    | During handling of the above exception, another exception occurred:
run_1    |
run_1    | Traceback (most recent call last):
run_1    |   File "/usr/local/lib/python3.8/dist-packages/urllib3/connectionpool.py", line 699, in urlopen
run_1    |     httplib_response = self._make_request(
run_1    |   File "/usr/local/lib/python3.8/dist-packages/urllib3/connectionpool.py", line 394, in _make_request
run_1    |     conn.request(method, url, **httplib_request_kw)
run_1    |   File "/usr/local/lib/python3.8/dist-packages/urllib3/connection.py", line 234, in request
run_1    |     super(HTTPConnection, self).request(method, url, body=body, headers=headers)
run_1    |   File "/usr/lib/python3.8/http/client.py", line 1255, in request
run_1    |     self._send_request(method, url, body, headers, encode_chunked)
run_1    |   File "/usr/lib/python3.8/http/client.py", line 1301, in _send_request
run_1    |     self.endheaders(body, encode_chunked=encode_chunked)
run_1    |   File "/usr/lib/python3.8/http/client.py", line 1250, in endheaders
run_1    |     self._send_output(message_body, encode_chunked=encode_chunked)
run_1    |   File "/usr/lib/python3.8/http/client.py", line 1010, in _send_output
run_1    |     self.send(msg)
run_1    |   File "/usr/lib/python3.8/http/client.py", line 950, in send
run_1    |     self.connect()
run_1    |   File "/usr/local/lib/python3.8/dist-packages/urllib3/connection.py", line 200, in connect
run_1    |     conn = self._new_conn()
run_1    |   File "/usr/local/lib/python3.8/dist-packages/urllib3/connection.py", line 181, in _new_conn
run_1    |     raise NewConnectionError(
run_1    | urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f99dc8e9130>: Failed to establish a new connection: [Errno 111] Connection refused
run_1    |
run_1    | During handling of the above exception, another exception occurred:
run_1    |
run_1    | Traceback (most recent call last):
run_1    |   File "/usr/local/lib/python3.8/dist-packages/requests/adapters.py", line 439, in send
run_1    |     resp = conn.urlopen(
run_1    |   File "/usr/local/lib/python3.8/dist-packages/urllib3/connectionpool.py", line 755, in urlopen
run_1    |     retries = retries.increment(
run_1    |   File "/usr/local/lib/python3.8/dist-packages/urllib3/util/retry.py", line 574, in increment
run_1    |     raise MaxRetryError(_pool, url, error or ResponseError(cause))
run_1    | urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='localhost', port=8080): Max retries exceeded with url: /graphql (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f99dc8e9130>: Failed to establish a new connection: [Errno 111] Connection refused'))
run_1    |
run_1    | During handling of the above exception, another exception occurred:
run_1    |
run_1    | Traceback (most recent call last):
run_1    |   File "/usr/local/lib/python3.8/dist-packages/wandb/old/retry.py", line 96, in __call__
run_1    |     result = self._call_fn(*args, **kwargs)
run_1    |   File "/usr/local/lib/python3.8/dist-packages/wandb/sdk/internal/internal_api.py", line 123, in execute
run_1    |     return self.client.execute(*args, **kwargs)
run_1    |   File "/usr/local/lib/python3.8/dist-packages/wandb/vendor/gql-0.2.0/gql/client.py", line 52, in execute
run_1    |     result = self._get_result(document, *args, **kwargs)
run_1    |   File "/usr/local/lib/python3.8/dist-packages/wandb/vendor/gql-0.2.0/gql/client.py", line 60, in _get_result
run_1    |     return self.transport.execute(document, *args, **kwargs)
run_1    |   File "/usr/local/lib/python3.8/dist-packages/wandb/vendor/gql-0.2.0/gql/transport/requests.py", line 38, in execute
run_1    |     request = requests.post(self.url, **post_args)
run_1    |   File "/usr/local/lib/python3.8/dist-packages/requests/api.py", line 119, in post
run_1    |     return request('post', url, data=data, json=json, **kwargs)
run_1    |   File "/usr/local/lib/python3.8/dist-packages/requests/api.py", line 61, in request
run_1    |     return session.request(method=method, url=url, **kwargs)
run_1    |   File "/usr/local/lib/python3.8/dist-packages/requests/sessions.py", line 542, in request
run_1    |     resp = self.send(prep, **send_kwargs)
run_1    |   File "/usr/local/lib/python3.8/dist-packages/requests/sessions.py", line 655, in send
run_1    |     r = adapter.send(request, **kwargs)
run_1    |   File "/usr/local/lib/python3.8/dist-packages/requests/adapters.py", line 516, in send
run_1    |     raise ConnectionError(e, request=request)
run_1    | requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=8080): Max retries exceeded with url: /graphql (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f99dc8e9130>: Failed to establish a new connection: [Errno 111] Connection refused'))
run_1    | wandb: Network error (ConnectionError), entering retry loop. See wandb/debug-internal.log for full traceback.
wandb_1  | *** All services started
wandb_1  | *** Access W&B at http://localhost:8080
run_1    | wandb: W&B API key is configured (use `wandb login --relogin` to force relogin)
run_1    | wandb: Network error (ConnectionError), entering retry loop. See wandb/debug-internal.log for full traceback.

ethanabrooks avatar Mar 15 '21 23:03 ethanabrooks

Sorry, I meant going to http://localhost:8080 in your web browser and seeing if there are any errors displayed on our application page.

vanpelt avatar Mar 16 '21 03:03 vanpelt

Hi Chris. Sorry for the delay in responding. I have been using the main site and it has been working well for me.

What do you see when you goto http://localhost:8080?

I don't know if this is relevant, but I am running on a server, not on a local machine. I don't access the site with localhost but with a URL that our server makes public.

So, in answer to your question this is what I see: If I delete the wandb volume, run docker-compose up and go to <host>:8080, it prompts me to create a new account and set my password. However, the next time I run docker-compose up (now that the volume exists and has content), the wandb container throws an error and <host>:8080 refuses to connect.

Here is the error, which is similar but slightly different from the one I posted originally

WARNING: The Docker Engine you're using is running in swarm mode.

Compose does not use swarm mode to deploy services to multiple nodes in a swarm. All containers will be scheduled on the current node.

To deploy your application across the swarm, use `docker stack deploy`.

Starting wandb-issue_wandb_1 ... done
Starting wandb-issue_run_1   ... done
Attaching to wandb-issue_run_1, wandb-issue_wandb_1
run_1    | Let's setup this directory for W&B!
wandb_1  | *** Running /etc/my_init.d/00_regen_ssh_host_keys.sh...
wandb_1  | *** Running /etc/my_init.d/01_enable-services.sh...
wandb_1  | *** Copying services to runit
wandb_1  | mv: cannot stat '/home/wandb/service/*': No such file or directory
wandb_1  | mv: cannot stat '/home/wandb/wandb-logrotate': No such file or directory
wandb_1  | *** Copying jobber template
wandb_1  | *** Enabling production mode
wandb_1  | ln: failed to create symbolic link '/etc/nginx/sites-enabled/wandb.conf': File exists
wandb_1  | *** /etc/my_init.d/01_enable-services.sh failed with status 1
wandb_1  |
wandb_1  | *** Killing all processes...
wandb-issue_wandb_1 exited with code 1
run_1    | Retry attempt failed:
run_1    | Traceback (most recent call last):
run_1    |   File "/usr/local/lib/python3.8/dist-packages/urllib3/connection.py", line 169, in _new_conn
run_1    |     conn = connection.create_connection(
run_1    |   File "/usr/local/lib/python3.8/dist-packages/urllib3/util/connection.py", line 73, in create_connection
run_1    |     for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
run_1    |   File "/usr/lib/python3.8/socket.py", line 918, in getaddrinfo
run_1    |     for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
run_1    | socket.gaierror: [Errno -2] Name or service not known
run_1    |
run_1    | During handling of the above exception, another exception occurred:
run_1    |
run_1    | Traceback (most recent call last):
run_1    |   File "/usr/local/lib/python3.8/dist-packages/urllib3/connectionpool.py", line 699, in urlopen
run_1    |     httplib_response = self._make_request(
run_1    |   File "/usr/local/lib/python3.8/dist-packages/urllib3/connectionpool.py", line 394, in _make_request
run_1    |     conn.request(method, url, **httplib_request_kw)
run_1    |   File "/usr/local/lib/python3.8/dist-packages/urllib3/connection.py", line 234, in request
run_1    |     super(HTTPConnection, self).request(method, url, body=body, headers=headers)
run_1    |   File "/usr/lib/python3.8/http/client.py", line 1255, in request
run_1    |     self._send_request(method, url, body, headers, encode_chunked)
run_1    |   File "/usr/lib/python3.8/http/client.py", line 1301, in _send_request
run_1    |     self.endheaders(body, encode_chunked=encode_chunked)
run_1    |   File "/usr/lib/python3.8/http/client.py", line 1250, in endheaders
run_1    |     self._send_output(message_body, encode_chunked=encode_chunked)
run_1    |   File "/usr/lib/python3.8/http/client.py", line 1010, in _send_output
run_1    |     self.send(msg)
run_1    |   File "/usr/lib/python3.8/http/client.py", line 950, in send
run_1    |     self.connect()
run_1    |   File "/usr/local/lib/python3.8/dist-packages/urllib3/connection.py", line 200, in connect
run_1    |     conn = self._new_conn()
run_1    |   File "/usr/local/lib/python3.8/dist-packages/urllib3/connection.py", line 181, in _new_conn
run_1    |     raise NewConnectionError(
run_1    | urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7fa9ee274f70>: Failed to establish a new connection: [Errno -2] Name or service not known
run_1    |
run_1    | During handling of the above exception, another exception occurred:
run_1    |
run_1    | Traceback (most recent call last):
run_1    |   File "/usr/local/lib/python3.8/dist-packages/requests/adapters.py", line 439, in send
run_1    |     resp = conn.urlopen(
run_1    |   File "/usr/local/lib/python3.8/dist-packages/urllib3/connectionpool.py", line 755, in urlopen
run_1    |     retries = retries.increment(
run_1    |   File "/usr/local/lib/python3.8/dist-packages/urllib3/util/retry.py", line 574, in increment
run_1    |     raise MaxRetryError(_pool, url, error or ResponseError(cause))
run_1    | urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='wandb', port=8080): Max retries exceeded with url: /graphql (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fa9ee274f70>: Failed to establish a new connection: [Errno -2] Name or service not known'))
run_1    |
run_1    | During handling of the above exception, another exception occurred:
run_1    |
run_1    | Traceback (most recent call last):
run_1    |   File "/usr/local/lib/python3.8/dist-packages/wandb/old/retry.py", line 96, in __call__
run_1    |     result = self._call_fn(*args, **kwargs)
run_1    |   File "/usr/local/lib/python3.8/dist-packages/wandb/sdk/internal/internal_api.py", line 123, in execute
run_1    |     return self.client.execute(*args, **kwargs)
run_1    |   File "/usr/local/lib/python3.8/dist-packages/wandb/vendor/gql-0.2.0/gql/client.py", line 52, in execute
run_1    |     result = self._get_result(document, *args, **kwargs)
run_1    |   File "/usr/local/lib/python3.8/dist-packages/wandb/vendor/gql-0.2.0/gql/client.py", line 60, in _get_result
run_1    |     return self.transport.execute(document, *args, **kwargs)
run_1    |   File "/usr/local/lib/python3.8/dist-packages/wandb/vendor/gql-0.2.0/gql/transport/requests.py", line 38, in execute
run_1    |     request = requests.post(self.url, **post_args)
run_1    |   File "/usr/local/lib/python3.8/dist-packages/requests/api.py", line 119, in post
run_1    |     return request('post', url, data=data, json=json, **kwargs)
run_1    |   File "/usr/local/lib/python3.8/dist-packages/requests/api.py", line 61, in request
run_1    |     return session.request(method=method, url=url, **kwargs)
run_1    |   File "/usr/local/lib/python3.8/dist-packages/requests/sessions.py", line 542, in request
run_1    |     resp = self.send(prep, **send_kwargs)
run_1    |   File "/usr/local/lib/python3.8/dist-packages/requests/sessions.py", line 655, in send
run_1    |     r = adapter.send(request, **kwargs)
run_1    |   File "/usr/local/lib/python3.8/dist-packages/requests/adapters.py", line 516, in send
run_1    |     raise ConnectionError(e, request=request)
run_1    | requests.exceptions.ConnectionError: HTTPConnectionPool(host='wandb', port=8080): Max retries exceeded with url: /graphql (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fa9ee274f70>: Failed to establish a new connection: [Errno -2] Name or service not known'))
run_1    | wandb: Network error (ConnectionError), entering retry loop. See wandb/debug-internal.log for full traceback.

ethanabrooks avatar Mar 21 '21 13:03 ethanabrooks

Hey @ethanabrooks, you'll likely need to stop the docker-compose stack entirely and then run up. I.E. docker-compose stop instead of just docker-compose down. The other error you're seeing is being caused because you're trying to connect to http://wandb:8080 and wherever you're running that wandb can't be resolved. That address will only work within the docker virtual network. You'll need to open up your firewalls and configure DNS if you want to expose the service on a different IP / Port.

vanpelt avatar Mar 22 '21 22:03 vanpelt

The wandb/local image is the server, while the command provided to it is meant for the client (which also needs a terminal - tty).

bzamecnik avatar Oct 08 '21 08:10 bzamecnik