issue-tracking
issue-tracking copied to clipboard
Bayesian hyperparameter search limited to one instance?
Before Asking:
- [x] I have searched the Issue Tracker.
- [x] I have searched the Documentation.
What is your question related to?
- [ ] Comet Python SDK
- [ ] Comet UI
- [x] Third Party Integrations (Huggingface, TensorboardX, Pytorch Lightning etc.)
What is your question?
I am trying to run several instances of experiments using Pytorch Lightning and Comet (3.31.4). However, when I tried to run multiple scripts at once (to a different project) with Bayesian hyperparameter search, one of them always has a timeout when trying to get next set of hyperparameter to try. Is the Bayesian optimization limited to one experiment at a time? Here is a sample of code that I used:
Code
from pytorch_lightning.loggers import CometLogger
from comet_ml import Optimizer
comet_config = {
"algorithm": "bayes",
"parameters":{
"layers": {"type": "integer", "min": 1, "max": 4},
"clip": {"type": "discrete", "values": [0.5, 1, -1]},
"lr": {"type": "discrete", "values": [0.00005, 0.00007, 0.0001]},
"batch_size": {"type": "discrete", "values": [16, 32, 64]},
"ff_dim": {"type": "discrete", "values": [1024, 2048]},
"model_dim": {"type": "discrete", "values": [256, 512]},
},
"spec": {
"metric": metric,
"objective": "maximize",
"maxCombo": 25,
"retryAssignLimit":10
}
}
opt = Optimizer(comet_config, api_key="",
experiment_class="OfflineExperiment",
auto_output_logging="simple",
log_git_metadata=False,
log_git_patch=False)
for experiment in opt.get_experiments():
layers = experiment.get_parameter("layers")
clip = experiment.get_parameter("clip")
lr = experiment.get_parameter("lr")
batch_size = experiment.get_parameter("batch_size")
ff_dim = experiment.get_parameter("ff_dim")
model_dim = experiment.get_parameter("model_dim")
comet_logger = CometLogger(
api_key='',
workspace='',
project_name='',
save_dir='./',
experiment_name='',
offline=True,
and the error that I get
Traceback (most recent call last):
File "/home/users/pkahardipraja/.local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 421, in _make_request
six.raise_from(e, None)
File "<string>", line 3, in raise_from
File "/home/users/pkahardipraja/.local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 416, in _make_request
httplib_response = conn.getresponse()
File "/usr/lib/python3.6/http/client.py", line 1346, in getresponse
response.begin()
File "/usr/lib/python3.6/http/client.py", line 307, in begin
version, status, reason = self._read_status()
File "/usr/lib/python3.6/http/client.py", line 268, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
File "/usr/lib/python3.6/socket.py", line 586, in readinto
return self._sock.recv_into(b)
File "/home/users/pkahardipraja/.local/lib/python3.6/site-packages/urllib3/contrib/pyopenssl.py", line 326, in recv_into
raise timeout("The read operation timed out")
socket.timeout: The read operation timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/requests/adapters.py", line 449, in send
timeout=timeout
File "/home/users/pkahardipraja/.local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 720, in urlopen
method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
File "/home/users/pkahardipraja/.local/lib/python3.6/site-packages/urllib3/util/retry.py", line 400, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/home/users/pkahardipraja/.local/lib/python3.6/site-packages/urllib3/packages/six.py", line 735, in reraise
raise value
File "/home/users/pkahardipraja/.local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 672, in urlopen
chunked=chunked,
File "/home/users/pkahardipraja/.local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 423, in _make_request
self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
File "/home/users/pkahardipraja/.local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 331, in _raise_timeout
self, url, "Read timed out. (read timeout=%s)" % timeout_value
urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='www.comet-ml.com', port=443): Read timed out. (read timeout=10)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "main_search.py", line 421, in <module>
main(cfgs)
File "main_search.py", line 186, in main
for experiment in opt.get_experiments():
File "/home/users/pkahardipraja/.local/lib/python3.6/site-packages/comet_ml/optimizer.py", line 205, in get_experiments
experiment = self.next(**kwargs)
File "/home/users/pkahardipraja/.local/lib/python3.6/site-packages/comet_ml/optimizer.py", line 257, in next
data = self.next_data()
File "/home/users/pkahardipraja/.local/lib/python3.6/site-packages/comet_ml/optimizer.py", line 284, in next_data
data = self._api.optimizer_next(self.id)
File "/home/users/pkahardipraja/.local/lib/python3.6/site-packages/comet_ml/connection.py", line 1522, in optimizer_next
results = self.get_request("next", params=params)
File "/home/users/pkahardipraja/.local/lib/python3.6/site-packages/comet_ml/connection.py", line 1451, in get_request
url, params=params, headers=headers, retry=False
File "/home/users/pkahardipraja/.local/lib/python3.6/site-packages/comet_ml/connection.py", line 402, in get
stream=stream,
File "/usr/local/lib/python3.6/dist-packages/requests/sessions.py", line 546, in get
return self.request('GET', url, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/requests/sessions.py", line 533, in request
resp = self.send(prep, **send_kwargs)
File "/usr/local/lib/python3.6/dist-packages/requests/sessions.py", line 646, in send
r = adapter.send(request, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/requests/adapters.py", line 529, in send
raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='www.comet-ml.com', port=443): Read timed out. (read timeout=10)
What have you tried?
Hello @pkhdipraja. Is it possible to share the rest of the snippet with me? I will try to reproduce the issue on my end. In the mean time could you try rerunning after setting the following environment variables? And then share the comet.log
file with me.
export COMET_LOGGING_FILE=./comet.log
export COMET_LOGGING_FILE_LEVEL=debug
Also, you mentioned that you're trying to run multiple experiments? What is your training setup like? Is it a single machine with multiple GPUs?
I got the same error on a system with only one graphics card and only one experiment running. The error only occurs if the experiment runs for a bit. So, my code runs fine, when you train for 10 epochs, but produces this error when you run it for 200 epochs. I have attached my log file here: comet.zip
For me it looks like the error is caused by 11901503 COMET ERROR [_online.py:682]: Error sending a notification, make sure you have opted-in for notifications
, but I have opted in for notifications, and as I said before, it runs fine if you don't train for too long.
Bump, I got the same problem after 200 epochs.
Hi all. Working on reproducing this this on my end. Will post an update ASAP.
Hi @pkhdipraja I noticed that there is an issue with the code you're using to create Experiments in your sweep. The Lightning CometLogger
already creates an Experiment for you, so you don't have to do that with the Optimizer. Here's an updated snippet. Could you try running it and see if it helps? (This snippet assumes you have set your credentials as environment variables)
from comet_ml import Optimizer
from pytorch_lightning.loggers import CometLogger
comet_config = {
"algorithm": "bayes",
"parameters": {
"layers": {"type": "integer", "min": 1, "max": 4},
"clip": {"type": "discrete", "values": [0.5, 1, -1]},
"lr": {"type": "discrete", "values": [0.00005, 0.00007, 0.0001]},
"batch_size": {"type": "discrete", "values": [16, 32, 64]},
"ff_dim": {"type": "discrete", "values": [1024, 2048]},
"model_dim": {"type": "discrete", "values": [256, 512]},
},
"spec": {
"metric": "loss",
"objective": "maximize",
"maxCombo": 25,
"retryAssignLimit": 10,
},
}
opt = Optimizer(
comet_config,
auto_output_logging="simple",
log_git_metadata=False,
log_git_patch=False,
)
for suggestion in opt.get_parameters():
comet_logger = CometLogger(offline=True)
parameters = suggestion["parameters"]
layers = parameters.get("layers")
clip = parameters.get("clip")
lr = parameters.get("lr")
batch_size = parameters.get("batch_size")
ff_dim = parameters.get("ff_dim")
model_dim = parameters.get("model_dim")
# run training
@YannickNagel @pauliusinc Can you each open new tickets for your issue? I tried running a very long experiment (500 epochs) with the Lightning Logger and I wasn't able to reproduce the error. Could you please include code snippets for what you're trying to do in the issue as well. It would be very helpful for debugging.