aim icon indicating copy to clipboard operation
aim copied to clipboard

How to know the reason of stopped tracking.

Open hcw-00 opened this issue 3 years ago • 4 comments

Hi. I'm trying to use aim server as docker container. And I'm running ML code in another container. Two container is connect via docker bridge network. Currently I'm struggling with two problems.

The one is that I cannot log more than two training simultaneously. Some of the running randomly stop to tracking raising errors like this.

Traceback (most recent call last):
  File "train.py", line 59, in main
    aim_logger.log_hyperparams(cfg)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/utilities/distributed.py", line 48, in wrapped_fn
    return fn(*args, **kwargs)
  File "/home/appuser/tiamo/tiamo/utils/aim/pytorch_lightning.py", line 97, in log_hyperparams
    self.experiment.set(('hparams', key), value, strict=False)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/loggers/base.py", line 42, in experiment
    return get_experiment() or DummyExperiment()
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/utilities/distributed.py", line 48, in wrapped_fn
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/loggers/base.py", line 40, in get_experiment
    return fn(self)
  File "/home/appuser/tiamo/tiamo/utils/aim/pytorch_lightning.py", line 73, in experiment
    self._run = Run(
  File "/usr/local/lib/python3.8/dist-packages/aim/sdk/run.py", line 437, in __init__
    self.meta_tree: TreeView = self.repo.request_tree(
  File "/usr/local/lib/python3.8/dist-packages/aim/sdk/repo.py", line 312, in request_tree
    return ProxyTree(self._client, name, sub, read_only, from_union)
  File "/usr/local/lib/python3.8/dist-packages/aim/storage/treeviewproxy.py", line 42, in __init__
    handler = self._rpc_client.get_resource_handler('TreeView', args=args)
  File "/usr/local/lib/python3.8/dist-packages/aim/ext/transport/client.py", line 63, in get_resource_handler
    response = self.remote.get_resource(request)
  File "/usr/local/lib/python3.8/dist-packages/grpc/_channel.py", line 946, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/usr/local/lib/python3.8/dist-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses"
	debug_error_string = "{"created":"@1654246083.032662701","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3093,"referenced_errors":[{"created":"@1654246083.032660970","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
>

The another is way more critical. Even in the case of logging single run, tracking is disconnected in some middle of training without leaving any errors. I get to know only through aim UI that tracking is stopped. So, my question is that is there any log file that I can figure out why tracking was stopped? Thank you.

hcw-00 avatar Jun 06 '22 14:06 hcw-00

@mihran113 could you pls take a look at this? is this related to remote server or grpc issues?

gorarakelyan avatar Jun 08 '22 08:06 gorarakelyan

hey @hcw-00! The first issue that you've described, is indicating that the client couldn't connect to the server. The possible causes of that is that server is down, or there are limitations of the connections count based on your setup. For the second one, unfortunately there are no additional log files that can indicate the reason of the failures. Can I ask you to provide a little bit more information of the setup(docker file and etc) so I can reproduce it on my end and an example of client side script that is failing?

mihran113 avatar Jun 08 '22 13:06 mihran113

@gorarakelyan @mihran113 Thank you! here is my Dockerfile

FROM aimstack/aim:lastest
ENTRYPOINT [] 

and these are run command.

docker run -it -v /home/changwoo/models/aim_repo:/opt/aim aim:0.0.1 -n aim_server /bin/bash
$ aim server
---
docker run -it -v /home/changwoo/models/aim_repo:/opt/aim -p 43800:43800 --network="host" aim:0.0.1 /bin/bash
$ aim up

network setting

docker network create aim-net
docker network connect aim-net aim_server
docker network connect aim-net my_ml_container
---
# in my training code
aim_logger = AimLogger(repo='aim://192.168.144.2:53800', ...)

Unfortunately, I'm not allowed to share the codes that failure occurred. But, I'm gonna try to reproduce it with another script, and if reproduced I will share with you.

hcw-00 avatar Jun 08 '22 15:06 hcw-00

hey @hcw-00! Just wanted to follow up, if you have been able to reproduce it any other way that can be shared? I've tried this setup and it seems to be working fine on my end, I was able to connect with multiple clients, and run a pretty long training without any unexpected stopping.

mihran113 avatar Aug 04 '22 13:08 mihran113