issue-tracking icon indicating copy to clipboard operation
issue-tracking copied to clipboard

Bayesian hyperparameter search limited to one instance?

Open pkhdipraja opened this issue 2 years ago • 6 comments

Before Asking:

  • [x] I have searched the Issue Tracker.
  • [x] I have searched the Documentation.

What is your question related to?

  • [ ] Comet Python SDK
  • [ ] Comet UI
  • [x] Third Party Integrations (Huggingface, TensorboardX, Pytorch Lightning etc.)

What is your question?

I am trying to run several instances of experiments using Pytorch Lightning and Comet (3.31.4). However, when I tried to run multiple scripts at once (to a different project) with Bayesian hyperparameter search, one of them always has a timeout when trying to get next set of hyperparameter to try. Is the Bayesian optimization limited to one experiment at a time? Here is a sample of code that I used:

Code

from pytorch_lightning.loggers import CometLogger
from comet_ml import Optimizer

comet_config = {
        "algorithm": "bayes", 
        "parameters":{
            "layers": {"type": "integer", "min": 1, "max": 4},
            "clip": {"type": "discrete", "values": [0.5, 1, -1]},
            "lr": {"type": "discrete", "values": [0.00005, 0.00007, 0.0001]},
            "batch_size": {"type": "discrete", "values": [16, 32, 64]},
            "ff_dim": {"type": "discrete", "values": [1024, 2048]},
            "model_dim": {"type": "discrete", "values": [256, 512]},
        },
        "spec": {
            "metric": metric,
            "objective": "maximize",
            "maxCombo": 25,
            "retryAssignLimit":10
        }
    }

    opt = Optimizer(comet_config, api_key="",
                    experiment_class="OfflineExperiment",
                    auto_output_logging="simple",
                    log_git_metadata=False,
                    log_git_patch=False)

    for experiment in opt.get_experiments():
        layers = experiment.get_parameter("layers")
        clip = experiment.get_parameter("clip")
        lr = experiment.get_parameter("lr")
        batch_size = experiment.get_parameter("batch_size")
        ff_dim = experiment.get_parameter("ff_dim")
        model_dim = experiment.get_parameter("model_dim")

        comet_logger = CometLogger(
            api_key='',
            workspace='',
            project_name='',
            save_dir='./',
            experiment_name='',
            offline=True,

and the error that I get

Traceback (most recent call last):
  File "/home/users/pkahardipraja/.local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 421, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/home/users/pkahardipraja/.local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 416, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/lib/python3.6/http/client.py", line 1346, in getresponse
    response.begin()
  File "/usr/lib/python3.6/http/client.py", line 307, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python3.6/http/client.py", line 268, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/usr/lib/python3.6/socket.py", line 586, in readinto
    return self._sock.recv_into(b)
  File "/home/users/pkahardipraja/.local/lib/python3.6/site-packages/urllib3/contrib/pyopenssl.py", line 326, in recv_into
    raise timeout("The read operation timed out")
socket.timeout: The read operation timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/requests/adapters.py", line 449, in send
    timeout=timeout
  File "/home/users/pkahardipraja/.local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 720, in urlopen
    method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
  File "/home/users/pkahardipraja/.local/lib/python3.6/site-packages/urllib3/util/retry.py", line 400, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/home/users/pkahardipraja/.local/lib/python3.6/site-packages/urllib3/packages/six.py", line 735, in reraise
    raise value
  File "/home/users/pkahardipraja/.local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 672, in urlopen
    chunked=chunked,
  File "/home/users/pkahardipraja/.local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 423, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
  File "/home/users/pkahardipraja/.local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 331, in _raise_timeout
    self, url, "Read timed out. (read timeout=%s)" % timeout_value
urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='www.comet-ml.com', port=443): Read timed out. (read timeout=10)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "main_search.py", line 421, in <module>
    main(cfgs)
  File "main_search.py", line 186, in main
    for experiment in opt.get_experiments():
  File "/home/users/pkahardipraja/.local/lib/python3.6/site-packages/comet_ml/optimizer.py", line 205, in get_experiments
    experiment = self.next(**kwargs)
  File "/home/users/pkahardipraja/.local/lib/python3.6/site-packages/comet_ml/optimizer.py", line 257, in next
    data = self.next_data()
  File "/home/users/pkahardipraja/.local/lib/python3.6/site-packages/comet_ml/optimizer.py", line 284, in next_data
    data = self._api.optimizer_next(self.id)
  File "/home/users/pkahardipraja/.local/lib/python3.6/site-packages/comet_ml/connection.py", line 1522, in optimizer_next
    results = self.get_request("next", params=params)
  File "/home/users/pkahardipraja/.local/lib/python3.6/site-packages/comet_ml/connection.py", line 1451, in get_request
    url, params=params, headers=headers, retry=False
  File "/home/users/pkahardipraja/.local/lib/python3.6/site-packages/comet_ml/connection.py", line 402, in get
    stream=stream,
  File "/usr/local/lib/python3.6/dist-packages/requests/sessions.py", line 546, in get
    return self.request('GET', url, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/requests/sessions.py", line 533, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python3.6/dist-packages/requests/sessions.py", line 646, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/requests/adapters.py", line 529, in send
    raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='www.comet-ml.com', port=443): Read timed out. (read timeout=10)

What have you tried?

pkhdipraja avatar Jun 21 '22 14:06 pkhdipraja

Hello @pkhdipraja. Is it possible to share the rest of the snippet with me? I will try to reproduce the issue on my end. In the mean time could you try rerunning after setting the following environment variables? And then share the comet.log file with me.

export COMET_LOGGING_FILE=./comet.log
export COMET_LOGGING_FILE_LEVEL=debug

Also, you mentioned that you're trying to run multiple experiments? What is your training setup like? Is it a single machine with multiple GPUs?

DN6 avatar Jun 28 '22 17:06 DN6

I got the same error on a system with only one graphics card and only one experiment running. The error only occurs if the experiment runs for a bit. So, my code runs fine, when you train for 10 epochs, but produces this error when you run it for 200 epochs. I have attached my log file here: comet.zip

For me it looks like the error is caused by 11901503 COMET ERROR [_online.py:682]: Error sending a notification, make sure you have opted-in for notifications, but I have opted in for notifications, and as I said before, it runs fine if you don't train for too long.

YannickNagel avatar Jul 04 '22 17:07 YannickNagel

Bump, I got the same problem after 200 epochs.

pauliusinc avatar Jul 11 '22 10:07 pauliusinc

Hi all. Working on reproducing this this on my end. Will post an update ASAP.

DN6 avatar Jul 12 '22 12:07 DN6

Hi @pkhdipraja I noticed that there is an issue with the code you're using to create Experiments in your sweep. The Lightning CometLogger already creates an Experiment for you, so you don't have to do that with the Optimizer. Here's an updated snippet. Could you try running it and see if it helps? (This snippet assumes you have set your credentials as environment variables)

from comet_ml import Optimizer
from pytorch_lightning.loggers import CometLogger

comet_config = {
    "algorithm": "bayes",
    "parameters": {
        "layers": {"type": "integer", "min": 1, "max": 4},
        "clip": {"type": "discrete", "values": [0.5, 1, -1]},
        "lr": {"type": "discrete", "values": [0.00005, 0.00007, 0.0001]},
        "batch_size": {"type": "discrete", "values": [16, 32, 64]},
        "ff_dim": {"type": "discrete", "values": [1024, 2048]},
        "model_dim": {"type": "discrete", "values": [256, 512]},
    },
    "spec": {
        "metric": "loss",
        "objective": "maximize",
        "maxCombo": 25,
        "retryAssignLimit": 10,
    },
}

opt = Optimizer(
    comet_config,
    auto_output_logging="simple",
    log_git_metadata=False,
    log_git_patch=False,
)

for suggestion in opt.get_parameters():
    comet_logger = CometLogger(offline=True)
    parameters = suggestion["parameters"]

    layers = parameters.get("layers")
    clip = parameters.get("clip")
    lr = parameters.get("lr")
    batch_size = parameters.get("batch_size")
    ff_dim = parameters.get("ff_dim")
    model_dim = parameters.get("model_dim")

    # run training

DN6 avatar Jul 15 '22 11:07 DN6

@YannickNagel @pauliusinc Can you each open new tickets for your issue? I tried running a very long experiment (500 epochs) with the Lightning Logger and I wasn't able to reproduce the error. Could you please include code snippets for what you're trying to do in the issue as well. It would be very helpful for debugging.

DN6 avatar Jul 15 '22 11:07 DN6