ray
ray copied to clipboard
[Ray Tune, Core] Ray tune self terminates at ~100 trials on windows
What happened + What you expected to happen
Running a ray tune experiment with pytorch and num_samples set at >100 trials results in ray self terminating at around 100 trials due to SIGINT. On resuming experiment, it will then self terminate at another ~100 trial. This only happens on windows, the same script will run to finish on linux
Expected behaviour: Ray tune completes all trials without sigint in between
Useful information: Error message can vary depending on where it terminates, as sometimes it seems to just print the trace stack at where it terminated.
However, sometimes it returns with:
2022-08-08 17:44:41,788 WARNING worker.py:1404 -- The log monitor on node DESKTOP-N18AJ6S failed with the following error:
OSError: [WinError 87] The parameter is incorrect
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "C:\Users\Steven\anaconda3\envs\GRF_hip_outcomes\lib\site-packages\ray\_private\log_monitor.py", line 451, in <module>
log_monitor.run()
File "C:\Users\Steven\anaconda3\envs\GRF_hip_outcomes\lib\site-packages\ray\_private\log_monitor.py", line 376, in run
self.open_closed_files()
File "C:\Users\Steven\anaconda3\envs\GRF_hip_outcomes\lib\site-packages\ray\_private\log_monitor.py", line 218, in open_closed_files
self.close_all_files()
File "C:\Users\Steven\anaconda3\envs\GRF_hip_outcomes\lib\site-packages\ray\_private\log_monitor.py", line 130, in close_all_files
os.kill(file_info.worker_pid, 0)
SystemError: <built-in function kill> returned a result with an error set
Versions / Dependencies
Ray tune 1.13.0 pytorch 1.11.0 windows 10/11
Reproduction script
This can be reproduced with the example ray tune pytorch script listed in Ray Tune documentation:
from __future__ import annotations
from functools import partial
import numpy as np
import os
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from filelock import FileLock
from torch.utils.data import random_split
import torchvision
import torchvision.transforms as transforms
import ray
from ray import tune
from ray.tune.schedulers import ASHAScheduler
def load_data(data_dir="./data"):
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])
# We add FileLock here because multiple workers will want to
# download data, and this may cause overwrites since
# DataLoader is not threadsafe.
with FileLock(os.path.expanduser("~/.data.lock")):
trainset = torchvision.datasets.CIFAR10(
root=data_dir, train=True, download=True, transform=transform)
testset = torchvision.datasets.CIFAR10(
root=data_dir, train=False, download=True, transform=transform)
return trainset, testset
class Net(nn.Module):
def __init__(self, l1=120, l2=84):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(3, 6, 5)
self.pool = nn.MaxPool2d(2, 2)
self.conv2 = nn.Conv2d(6, 16, 5)
self.fc1 = nn.Linear(16 * 5 * 5, l1)
self.fc2 = nn.Linear(l1, l2)
self.fc3 = nn.Linear(l2, 10)
def forward(self, x):
x = self.pool(F.relu(self.conv1(x)))
x = self.pool(F.relu(self.conv2(x)))
x = x.view(-1, 16 * 5 * 5)
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(x)
return x
def train_cifar(config, checkpoint_dir=None):
project_dir = os.getenv('TUNE_ORIG_WORKING_DIR')
net = Net(config["l1"], config["l2"])
device = "cpu"
if torch.cuda.is_available():
device = "cuda:0"
if torch.cuda.device_count() > 1:
net = nn.DataParallel(net)
net.to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=config["lr"], momentum=0.9)
# The `checkpoint_dir` parameter gets passed by Ray Tune when a checkpoint
# should be restored.
if checkpoint_dir:
checkpoint = os.path.join(checkpoint_dir, "checkpoint")
model_state, optimizer_state = torch.load(checkpoint)
net.load_state_dict(model_state)
optimizer.load_state_dict(optimizer_state)
data_dir = os.path.join(project_dir, "data")
trainset, testset = load_data(data_dir)
test_abs = int(len(trainset) * 0.8)
train_subset, val_subset = random_split(
trainset, [test_abs, len(trainset) - test_abs])
trainloader = torch.utils.data.DataLoader(
train_subset,
batch_size=int(config["batch_size"]),
shuffle=True,
num_workers=8)
valloader = torch.utils.data.DataLoader(
val_subset,
batch_size=int(config["batch_size"]),
shuffle=True,
num_workers=8)
for epoch in range(10): # loop over the dataset multiple times
running_loss = 0.0
epoch_steps = 0
for i, data in enumerate(trainloader, 0):
# get the inputs; data is a list of [inputs, labels]
inputs, labels = data
inputs, labels = inputs.to(device), labels.to(device)
# zero the parameter gradients
optimizer.zero_grad()
# forward + backward + optimize
outputs = net(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
# print statistics
running_loss += loss.item()
epoch_steps += 1
if i % 2000 == 1999: # print every 2000 mini-batches
print("[%d, %5d] loss: %.3f" % (epoch + 1, i + 1,
running_loss / epoch_steps))
running_loss = 0.0
# Validation loss
val_loss = 0.0
val_steps = 0
total = 0
correct = 0
for i, data in enumerate(valloader, 0):
with torch.no_grad():
inputs, labels = data
inputs, labels = inputs.to(device), labels.to(device)
outputs = net(inputs)
_, predicted = torch.max(outputs.data, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()
loss = criterion(outputs, labels)
val_loss += loss.cpu().numpy()
val_steps += 1
# Here we save a checkpoint. It is automatically registered with
# Ray Tune and will potentially be passed as the `checkpoint_dir`
# parameter in future iterations.
with tune.checkpoint_dir(step=epoch) as checkpoint_dir:
path = os.path.join(checkpoint_dir, "checkpoint")
torch.save(
(net.state_dict(), optimizer.state_dict()), path)
tune.report(loss=(val_loss / val_steps), accuracy=correct / total)
print("Finished Training")
def test_best_model(best_trial):
best_trained_model = Net(best_trial.config["l1"], best_trial.config["l2"])
device = "cuda:0" if torch.cuda.is_available() else "cpu"
best_trained_model.to(device)
checkpoint_path = os.path.join(best_trial.checkpoint.value, "checkpoint")
model_state, optimizer_state = torch.load(checkpoint_path)
best_trained_model.load_state_dict(model_state)
trainset, testset = load_data()
testloader = torch.utils.data.DataLoader(
testset, batch_size=4, shuffle=False, num_workers=2)
correct = 0
total = 0
with torch.no_grad():
for data in testloader:
images, labels = data
images, labels = images.to(device), labels.to(device)
outputs = best_trained_model(images)
_, predicted = torch.max(outputs.data, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()
print("Best trial test set accuracy: {}".format(correct / total))
def main(num_samples=10, max_num_epochs=10, gpus_per_trial: int|float=2):
config = {
"l1": tune.sample_from(lambda _: 2 ** np.random.randint(2, 9)),
"l2": tune.sample_from(lambda _: 2 ** np.random.randint(2, 9)),
"lr": tune.loguniform(1e-4, 1e-1),
"batch_size": tune.choice([2, 4, 8, 16])
}
scheduler = ASHAScheduler(
max_t=max_num_epochs,
grace_period=1,
reduction_factor=2)
result = tune.run(
tune.with_parameters(train_cifar),
resources_per_trial={"cpu": 2, "gpu": gpus_per_trial},
config=config,
metric="loss",
mode="min",
num_samples=num_samples,
scheduler=scheduler,
verbose=2,
local_dir='/e/GRF_outputs2',
name='train_cifar_2022-08-08_14-32-23',
resume=True
)
best_trial = result.get_best_trial("loss", "min", "last")
print("Best trial config: {}".format(best_trial.config))
print("Best trial final validation loss: {}".format(
best_trial.last_result["loss"]))
print("Best trial final validation accuracy: {}".format(
best_trial.last_result["accuracy"]))
if ray.util.client.ray.is_connected():
# If using Ray Client, we want to make sure checkpoint access
# happens on the server. So we wrap `test_best_model` in a Ray task.
# We have to make sure it gets executed on the same node that
# ``tune.run`` is called on.
from ray.util.ml_utils.node import force_on_current_node
remote_fn = force_on_current_node(ray.remote(test_best_model))
ray.get(remote_fn.remote(best_trial))
else:
test_best_model(best_trial)
if __name__ == "__main__":
# You can change the number of GPUs per trial here:
main(num_samples=250, max_num_epochs=2, gpus_per_trial=0.5)
the most strip down script I've attempted without losing machinie learning is:
import logging
import os
import ray
import torch
from ray import tune
from ray.tune import CLIReporter
from torch.utils.data import TensorDataset, DataLoader
from torch import nn
class DummyModel(nn.Module):
def __init__(self, a, output_size, channel_n=9):
super().__init__()
self.output_size=output_size
self.layers = nn.Sequential(nn.Linear(1,a),
nn.ReLU(inplace=True),
nn.Linear(a,output_size))
self.softmax = nn.LogSoftmax(dim=1)
def forward(self, x):
x = x.view(-1, 1)
out = self.layers(x)
return self.softmax(out)
os.environ["TUNE_DISABLE_AUTO_CALLBACK_SYNCER"] = "1"
logger = logging.getLogger(__name__)
def get_dataloader(batch_size=50, num_workers=4):
class1 = [2, 4, 6, 8, 10, 12, 14, 16]
class2 = [3, 6, 9, 12, 15, 18, 21, 24, 27]
class3 = [1, 5, 7, 11, 13, 17, 19]
data_x = torch.tensor(class1 + class2 + class3, dtype=torch.float)
data_y = torch.tensor(len(class1) * [0] + len(class2) * [1] + len(class3) * [2])
dataset = TensorDataset(data_x, data_y)
return DataLoader(dataset, batch_size=batch_size, num_workers=num_workers)
def train_func(config, epochs=2):
# cwd is changed to the trial folder
batch_size = config["batch_size"]
weight_decay = config['weight_decay']
a = config["a"]
lr = config['lr']
# setup code
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
# setup dataloaders
data_loader = get_dataloader(batch_size, num_workers=4)
output_size = 3
channel_n = 9
# setup model
model = DummyModel(a=a, output_size=output_size, channel_n=channel_n)
model = model.to(device)
criterion = torch.nn.NLLLoss()
optimizer = torch.optim.AdamW(model.parameters(), lr=lr, weight_decay=weight_decay, amsgrad=True)
for epoch in range(0, epochs): # loop over the dataset multiple times
log = {}
model.train()
for i, data in enumerate(data_loader, 0):
# get the inputs; data is a list of [inputs, labels]
inputs, targets = data
inputs, targets = inputs.to(device), targets.to(device)
optimizer.zero_grad()
# forward + backward
outputs = model(inputs)
loss = criterion(outputs, targets)
loss.backward()
optimizer.step()
log = {"loss": loss.item()}
# log metrics, log in checkpoint in case actor dies half way
tune.report(**log)
def main():
tune.run(
train_func,
num_samples=250,
metric='loss',
mode="min",
resources_per_trial={"cpu": 4, "gpu": 0.15},
local_dir="outputs/tune-dummy",
verbose=2,
config={
"lr": ray.tune.loguniform(0.0001, 0.01),
"weight_decay": ray.tune.loguniform(0.001, 0.01),
"batch_size": ray.tune.choice([128, 256]),
"a": ray.tune.randint(1, 30)
},
progress_reporter=CLIReporter(metric_columns=["training_iteration", "loss"])
)
print("I finished!")
if __name__ == '__main__':
main()
Issue Severity
Medium: It is a significant difficulty but I can work around it.
@mattip is this something you can take a look at?
What version of CUDA are you using?
@mattip CUDA 11.3, but have tried running this on just cpu (with 6 concurrent trials) and still run into the same problem
I could not reproduce when using the resource settings below, the second script ran to completion. I have a windows 10 machine with 64GB RAM.
resources_per_trial={"cpu": 6, "gpu": 0, "memory": 16 * 1024 * 1024 * 1024,},
I created the environment via
>conda create -n issue27646 python=3.9
>conda activate issue27646
>conda install --channel pytorch pytorch=1.11 pip pandas
>pip install "ray==1.13" "ray[tune]"
I tried to run the first script. After I run conda install -c pytorch torchvision=0.12
I needed to set resume="AUTO"
, and then after a while I get
OSError: [WinError 1455] The paging file is too small for this operation to complete. Error loading "d:\miniconda\envs\issue27646\lib\site-packages\torch\lib\cudnn_cnn_infer64_8.dll" or one of its dependencies.
This is after I updated to the latest CUDA 11.7, following this stack overflow which has a note that starting from CUDA 11.7 they will change the compilation flags so more memory can be shared between dlls. I used the script in that answer to modify the DLLs in the directory mentioned above, and reran the script. This time it started running.
After about 4000 seconds, the run stopped with a message WARNING tune.py:682 -- Stop signal received (e.g. via SIGINT/Ctrl+C), ending Ray Tune run. This will try to checkpoint the experiment state one last time. Press CTRL+C (or send SIGINT/SIGKILL/SIGTERM) to skip.
I didn't touch the machine, so something caused a failure. In raylet.out
I see
[2022-08-11 19:24:20,078 I 14084 8692] (raylet.exe) main.cc:291: Raylet received SIGTERM, shutting down...
[2022-08-11 19:24:20,078 I 14084 8692] (raylet.exe) accessor.cc:430: Unregistering node info, node id = 0157f63c3e0cc3b6d2b4cc1fd4e190963b3007c468668082a0396e62
[2022-08-11 19:24:20,086 I 14084 8692] (raylet.exe) io_service_pool.cc:47: IOServicePool is stopped.
[2022-08-11 19:24:20,138 W 14084 13392] (raylet.exe) agent_manager.cc:104: Agent process with pid 7232 exit, return value 21. ip 127.0.0.1. pid 7232
[2022-08-11 19:24:20,138 E 14084 13392] (raylet.exe) agent_manager.cc:107: The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent was unexpectedly killed or failed. See `dashboard_agent.log` for the root cause.
but I don't see any errors in the dashboar_agent.log
.
last few hundred lines of console output
== Status ==
Current time: 2022-08-11 19:22:44 (running for 01:07:27.97)
Memory usage on this node: 15.4/63.9 GiB
Using AsyncHyperBand: num_stopped=94
Bracket: Iter 2.000: -1.480599593940377 | Iter 1.000: -2.035428786563873
Resources requested: 4.0/24 CPUs, 1.0/1 GPUs, 0.0/35.49 GiB heap, 0.0/17.75 GiB objects
Current best trial: 67d45_00008 with loss=1.2783956696420908 and parameters={'l1': 128, 'l2': 256, 'lr': 0.0009443195440545546, 'batch_size': 4}
Result logdir: d:\e\GRF_outputs2\train_cifar_2022-08-08_14-32-23
Number of trials: 122/250 (26 PENDING, 2 RUNNING, 94 TERMINATED)
Trial train_cifar_67d45_00093 reported loss=1.65 with parameters={'l1': 4, 'l2': 32, 'lr': 0.0008004035146948047, 'batch_size': 2}. This trial completed.
(pid=) 2022-08-11 19:22:48,652 INFO context.py:67 -- Exec'ing worker with command: "d:\miniconda\envs\issue27646\python.exe" d:\miniconda\envs\issue27646\lib\site-packages\ray\workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=63578 --object-store-name=tcp://127.0.0.1:65094 --raylet-name=tcp://127.0.0.1:64190 --redis-address=None --storage=None --temp-dir=C:\Users\matti\AppData\Local\Temp\ray --metrics-agent-port=65337 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:65286 --redis-password=5241590000000000 --startup-token=117 --runtime-env-hash=-2001109766
(train_cifar pid=11188) [1, 2000] loss: 2.030
(train_cifar pid=11548) Files already downloaded and verified
2022-08-11 19:22:51,667 WARNING tune.py:682 -- Stop signal received (e.g. via SIGINT/Ctrl+C), ending Ray Tune run. This will try to checkpoint the experiment state one last time. Press CTRL+C (or send SIGINT/SIGKILL/SIGTERM) to skip.
(train_cifar pid=11548) Files already downloaded and verified
(train_cifar pid=11188) [1, 4000] loss: 0.853
== Status ==
Current time: 2022-08-11 19:22:53 (running for 01:07:37.01)
Memory usage on this node: 12.2/63.9 GiB
Using AsyncHyperBand: num_stopped=95
Bracket: Iter 2.000: -1.4806991674900054 | Iter 1.000: -2.035428786563873
Resources requested: 4.0/24 CPUs, 1.0/1 GPUs, 0.0/35.49 GiB heap, 0.0/17.75 GiB objects
Current best trial: 67d45_00008 with loss=1.2783956696420908 and parameters={'l1': 128, 'l2': 256, 'lr': 0.0009443195440545546, 'batch_size': 4}
Result logdir: d:\e\GRF_outputs2\train_cifar_2022-08-08_14-32-23
Number of trials: 123/250 (26 PENDING, 2 RUNNING, 95 TERMINATED)
== Status ==
Current time: 2022-08-11 19:22:53 (running for 01:07:37.02)
Memory usage on this node: 12.2/63.9 GiB
Using AsyncHyperBand: num_stopped=95
Bracket: Iter 2.000: -1.4806991674900054 | Iter 1.000: -2.035428786563873
Resources requested: 4.0/24 CPUs, 1.0/1 GPUs, 0.0/35.49 GiB heap, 0.0/17.75 GiB objects
Current best trial: 67d45_00008 with loss=1.2783956696420908 and parameters={'l1': 128, 'l2': 256, 'lr': 0.0009443195440545546, 'batch_size': 4}
Result logdir: d:\e\GRF_outputs2\train_cifar_2022-08-08_14-32-23
Number of trials: 123/250 (26 PENDING, 2 RUNNING, 95 TERMINATED)
+-------------------------+------------+-----------------+--------------+------+------+-------------+--------+------------------+---------+------------+
| Trial name | status | loc | batch_size | l1 | l2 | lr | iter | total time (s) | loss | accuracy |
|-------------------------+------------+-----------------+--------------+------+------+-------------+--------+------------------+---------+------------|
| train_cifar_67d45_00095 | RUNNING | 127.0.0.1:11188 | 4 | 16 | 256 | 0.00248724 | | | | |
| train_cifar_67d45_00096 | RUNNING | 127.0.0.1:11548 | 2 | 128 | 16 | 0.00299209 | | | | |
| train_cifar_67d45_00097 | PENDING | | 8 | 8 | 32 | 0.0011024 | | | | |
| train_cifar_67d45_00098 | PENDING | | 16 | 8 | 32 | 0.000731722 | | | | |
| train_cifar_67d45_00099 | PENDING | | 4 | 4 | 8 | 0.00119135 | | | | |
| train_cifar_67d45_00100 | PENDING | | 16 | 256 | 256 | 0.0024511 | | | | |
| train_cifar_67d45_00101 | PENDING | | 16 | 64 | 16 | 0.000200764 | | | | |
| train_cifar_67d45_00102 | PENDING | | 4 | 128 | 64 | 0.0108665 | | | | |
| train_cifar_67d45_00103 | PENDING | | 8 | 32 | 4 | 0.00048593 | | | | |
| train_cifar_67d45_00104 | PENDING | | 2 | 16 | 32 | 0.00235333 | | | | |
| train_cifar_67d45_00105 | PENDING | | 4 | 256 | 16 | 0.00226565 | | | | |
| train_cifar_67d45_00106 | PENDING | | 4 | 32 | 16 | 0.0681302 | | | | |
| train_cifar_67d45_00107 | PENDING | | 2 | 16 | 16 | 0.0211689 | | | | |
| train_cifar_67d45_00108 | PENDING | | 4 | 4 | 256 | 0.000285824 | | | | |
| train_cifar_67d45_00109 | PENDING | | 8 | 16 | 256 | 0.0564388 | | | | |
| train_cifar_67d45_00110 | PENDING | | 2 | 32 | 4 | 0.00156365 | | | | |
| train_cifar_67d45_00111 | PENDING | | 16 | 64 | 4 | 0.0973652 | | | | |
| train_cifar_67d45_00112 | PENDING | | 16 | 8 | 8 | 0.00774178 | | | | |
| train_cifar_67d45_00113 | PENDING | | 8 | 256 | 64 | 0.00879403 | | | | |
| train_cifar_67d45_00114 | PENDING | | 2 | 128 | 256 | 0.000352467 | | | | |
| train_cifar_67d45_00115 | PENDING | | 16 | 32 | 4 | 0.000714886 | | | | |
| train_cifar_67d45_00116 | PENDING | | 8 | 8 | 64 | 0.00794146 | | | | |
| train_cifar_67d45_00117 | PENDING | | 8 | 8 | 4 | 0.000130055 | | | | |
| train_cifar_67d45_00118 | PENDING | | 4 | 4 | 64 | 0.000361327 | | | | |
| train_cifar_67d45_00119 | PENDING | | 2 | 16 | 4 | 0.0193271 | | | | |
| train_cifar_67d45_00120 | PENDING | | 4 | 64 | 16 | 0.00325031 | | | | |
| train_cifar_67d45_00121 | PENDING | | 16 | 256 | 64 | 0.0396648 | | | | |
| train_cifar_67d45_00122 | PENDING | | 16 | 32 | 16 | 0.000303553 | | | | |
| train_cifar_67d45_00000 | TERMINATED | 127.0.0.1:336 | 16 | 16 | 4 | 0.0124023 | 2 | 77.2554 | 1.59222 | 0.4155 |
| train_cifar_67d45_00001 | TERMINATED | 127.0.0.1:11812 | 4 | 16 | 64 | 0.00403809 | 2 | 110.609 | 1.52443 | 0.4596 |
| train_cifar_67d45_00002 | TERMINATED | 127.0.0.1:13492 | 8 | 4 | 4 | 0.0913862 | 1 | 45.6294 | 2.31545 | 0.0994 |
| train_cifar_67d45_00003 | TERMINATED | 127.0.0.1:4608 | 16 | 128 | 8 | 0.0640589 | 1 | 40.6615 | 2.31283 | 0.0942 |
| train_cifar_67d45_00004 | TERMINATED | 127.0.0.1:3704 | 2 | 8 | 256 | 0.000150398 | 2 | 151.123 | 1.4805 | 0.4549 |
| train_cifar_67d45_00005 | TERMINATED | 127.0.0.1:4064 | 16 | 8 | 4 | 0.00111178 | 1 | 41.0484 | 2.17909 | 0.198 |
| train_cifar_67d45_00006 | TERMINATED | 127.0.0.1:3968 | 16 | 32 | 256 | 0.00632639 | 2 | 77.9044 | 1.30005 | 0.534 |
| train_cifar_67d45_00007 | TERMINATED | 127.0.0.1:10868 | 8 | 4 | 16 | 0.0250599 | 1 | 46.0688 | 2.31255 | 0.1043 |
| train_cifar_67d45_00008 | TERMINATED | 127.0.0.1:11900 | 4 | 128 | 256 | 0.00094432 | 2 | 111.369 | 1.2784 | 0.5435 |
| train_cifar_67d45_00009 | TERMINATED | 127.0.0.1:2348 | 2 | 64 | 128 | 0.000111395 | 1 | 78.9413 | 1.94607 | 0.2959 |
| train_cifar_67d45_00010 | TERMINATED | 127.0.0.1:10284 | 2 | 256 | 4 | 0.00148054 | 1 | 79.7107 | 2.30502 | 0.1024 |
| train_cifar_67d45_00011 | TERMINATED | 127.0.0.1:9696 | 8 | 64 | 8 | 0.000299093 | 1 | 46.096 | 2.23853 | 0.1884 |
| train_cifar_67d45_00012 | TERMINATED | 127.0.0.1:12260 | 16 | 8 | 4 | 0.0101674 | 2 | 77.8852 | 1.59929 | 0.4011 |
| train_cifar_67d45_00013 | TERMINATED | 127.0.0.1:8476 | 8 | 32 | 16 | 0.0124174 | 2 | 87.9289 | 1.71918 | 0.3633 |
| train_cifar_67d45_00014 | TERMINATED | 127.0.0.1:1516 | 8 | 16 | 32 | 0.0323596 | 1 | 45.8615 | 2.31084 | 0.0981 |
| train_cifar_67d45_00015 | TERMINATED | 127.0.0.1:9808 | 16 | 128 | 4 | 0.000842029 | 2 | 77.9852 | 1.77241 | 0.3257 |
| train_cifar_67d45_00016 | TERMINATED | 127.0.0.1:13544 | 8 | 64 | 8 | 0.000126637 | 1 | 46.2296 | 2.30141 | 0.1111 |
| train_cifar_67d45_00017 | TERMINATED | 127.0.0.1:9628 | 2 | 128 | 128 | 0.000865456 | 2 | 157.568 | 1.34067 | 0.518 |
| train_cifar_67d45_00018 | TERMINATED | 127.0.0.1:10988 | 8 | 4 | 4 | 0.000133137 | 1 | 46.7045 | 2.31236 | 0.0943 |
| train_cifar_67d45_00019 | TERMINATED | 127.0.0.1:11440 | 2 | 64 | 256 | 0.000839588 | 2 | 157.86 | 1.34019 | 0.5154 |
| train_cifar_67d45_00020 | TERMINATED | 127.0.0.1:7180 | 4 | 8 | 64 | 0.00176578 | 2 | 112.179 | 1.35746 | 0.5088 |
| train_cifar_67d45_00021 | TERMINATED | 127.0.0.1:2072 | 8 | 32 | 64 | 0.0359645 | 1 | 45.7077 | 2.3174 | 0.0997 |
| train_cifar_67d45_00022 | TERMINATED | 127.0.0.1:6648 | 16 | 32 | 4 | 0.0691986 | 1 | 40.5521 | 2.31416 | 0.1017 |
| train_cifar_67d45_00023 | TERMINATED | 127.0.0.1:12884 | 16 | 16 | 16 | 0.0348501 | 1 | 40.1453 | 1.98061 | 0.2451 |
| train_cifar_67d45_00024 | TERMINATED | 127.0.0.1:3928 | 16 | 4 | 8 | 0.00240846 | 1 | 40.6082 | 2.0129 | 0.2385 |
| train_cifar_67d45_00025 | TERMINATED | 127.0.0.1:14028 | 2 | 16 | 256 | 0.0916574 | 1 | 77.8128 | 2.40566 | 0.0987 |
| train_cifar_67d45_00026 | TERMINATED | 127.0.0.1:1308 | 8 | 16 | 64 | 0.000723566 | 2 | 86.6233 | 1.4807 | 0.4534 |
| train_cifar_67d45_00027 | TERMINATED | 127.0.0.1:3740 | 16 | 16 | 64 | 0.000106305 | 1 | 40.1516 | 2.30268 | 0.1006 |
| train_cifar_67d45_00028 | TERMINATED | 127.0.0.1:4216 | 16 | 64 | 32 | 0.000128329 | 1 | 40.4751 | 2.30131 | 0.1052 |
| train_cifar_67d45_00029 | TERMINATED | 127.0.0.1:12724 | 2 | 32 | 32 | 0.0868636 | 1 | 76.6885 | 2.43158 | 0.1006 |
| train_cifar_67d45_00030 | TERMINATED | 127.0.0.1:6016 | 4 | 16 | 128 | 0.000196543 | 2 | 111.905 | 1.57292 | 0.4263 |
| train_cifar_67d45_00031 | TERMINATED | 127.0.0.1:2380 | 16 | 4 | 16 | 0.00353314 | 2 | 76.4946 | 1.47283 | 0.4526 |
| train_cifar_67d45_00032 | TERMINATED | 127.0.0.1:6264 | 4 | 64 | 32 | 0.00470306 | 2 | 110.682 | 1.61346 | 0.4188 |
| train_cifar_67d45_00033 | TERMINATED | 127.0.0.1:2660 | 4 | 32 | 16 | 0.000171846 | 1 | 56.9745 | 2.061 | 0.2028 |
| train_cifar_67d45_00034 | TERMINATED | 127.0.0.1:11748 | 8 | 4 | 4 | 0.0356921 | 1 | 46.5302 | 2.31146 | 0.1007 |
| train_cifar_67d45_00035 | TERMINATED | 127.0.0.1:13868 | 8 | 256 | 256 | 0.00027364 | 1 | 46.8195 | 2.02851 | 0.2424 |
| train_cifar_67d45_00036 | TERMINATED | 127.0.0.1:4852 | 2 | 8 | 128 | 0.0272566 | 1 | 78.8351 | 2.34708 | 0.099 |
| train_cifar_67d45_00037 | TERMINATED | 127.0.0.1:3896 | 4 | 8 | 64 | 0.0293812 | 1 | 58.2364 | 2.31463 | 0.0978 |
| train_cifar_67d45_00038 | TERMINATED | 127.0.0.1:13252 | 2 | 4 | 64 | 0.000994105 | 2 | 149.891 | 1.47705 | 0.4587 |
| train_cifar_67d45_00039 | TERMINATED | 127.0.0.1:11456 | 4 | 128 | 32 | 0.00078335 | 2 | 112.726 | 1.35677 | 0.5105 |
| train_cifar_67d45_00040 | TERMINATED | 127.0.0.1:10568 | 16 | 8 | 4 | 0.00592756 | 2 | 78.1619 | 1.8449 | 0.2313 |
| train_cifar_67d45_00041 | TERMINATED | 127.0.0.1:12360 | 4 | 64 | 128 | 0.000160305 | 1 | 55.9045 | 2.03543 | 0.234 |
| train_cifar_67d45_00042 | TERMINATED | 127.0.0.1:13544 | 4 | 256 | 128 | 0.00492324 | 2 | 108.62 | 1.55762 | 0.4429 |
| train_cifar_67d45_00043 | TERMINATED | 127.0.0.1:11532 | 8 | 32 | 8 | 0.000333664 | 1 | 46.5984 | 2.27785 | 0.1753 |
| train_cifar_67d45_00044 | TERMINATED | 127.0.0.1:2348 | 2 | 32 | 128 | 0.000241014 | 2 | 155 | 1.43656 | 0.4784 |
| train_cifar_67d45_00045 | TERMINATED | 127.0.0.1:4280 | 16 | 8 | 8 | 0.000284191 | 1 | 40.9402 | 2.28762 | 0.1432 |
| train_cifar_67d45_00046 | TERMINATED | 127.0.0.1:2952 | 8 | 64 | 128 | 0.00545612 | 2 | 89.7666 | 1.37725 | 0.5139 |
| train_cifar_67d45_00047 | TERMINATED | 127.0.0.1:13204 | 2 | 16 | 8 | 0.00372381 | 2 | 149.886 | 1.96147 | 0.2749 |
| train_cifar_67d45_00048 | TERMINATED | 127.0.0.1:9816 | 16 | 4 | 4 | 0.0129934 | 2 | 76.3272 | 1.70723 | 0.3373 |
| train_cifar_67d45_00049 | TERMINATED | 127.0.0.1:11536 | 16 | 4 | 128 | 0.000137002 | 1 | 40.6608 | 2.23616 | 0.1501 |
| train_cifar_67d45_00050 | TERMINATED | 127.0.0.1:12232 | 2 | 256 | 4 | 0.000396095 | 2 | 151.73 | 1.53373 | 0.4467 |
| train_cifar_67d45_00051 | TERMINATED | 127.0.0.1:12620 | 4 | 16 | 4 | 0.0019273 | 2 | 113.352 | 1.60781 | 0.3987 |
| train_cifar_67d45_00052 | TERMINATED | 127.0.0.1:12420 | 2 | 8 | 32 | 0.00083404 | 2 | 152.35 | 1.42813 | 0.479 |
| train_cifar_67d45_00053 | TERMINATED | 127.0.0.1:8036 | 4 | 64 | 32 | 0.0167031 | 1 | 58.3754 | 2.15813 | 0.1692 |
| train_cifar_67d45_00054 | TERMINATED | 127.0.0.1:4376 | 2 | 128 | 16 | 0.0185858 | 1 | 79.929 | 2.3207 | 0.1014 |
| train_cifar_67d45_00055 | TERMINATED | 127.0.0.1:1224 | 8 | 64 | 128 | 0.00030277 | 1 | 47.0069 | 2.07092 | 0.2412 |
| train_cifar_67d45_00056 | TERMINATED | 127.0.0.1:14256 | 8 | 16 | 4 | 0.00138638 | 1 | 45.2928 | 2.29623 | 0.1288 |
| train_cifar_67d45_00057 | TERMINATED | 127.0.0.1:13172 | 4 | 256 | 128 | 0.041717 | 1 | 56.7852 | 2.33236 | 0.1046 |
| train_cifar_67d45_00058 | TERMINATED | 127.0.0.1:864 | 2 | 16 | 256 | 0.000721058 | 2 | 153.45 | 1.44545 | 0.4873 |
| train_cifar_67d45_00059 | TERMINATED | 127.0.0.1:13868 | 16 | 128 | 16 | 0.000218716 | 1 | 40.154 | 2.29752 | 0.1432 |
| train_cifar_67d45_00060 | TERMINATED | 127.0.0.1:10920 | 2 | 8 | 4 | 0.0107507 | 1 | 79.4329 | 2.3216 | 0.097 |
| train_cifar_67d45_00061 | TERMINATED | 127.0.0.1:5440 | 16 | 32 | 64 | 0.000724089 | 1 | 40.6402 | 2.06584 | 0.2631 |
| train_cifar_67d45_00062 | TERMINATED | 127.0.0.1:4336 | 2 | 8 | 16 | 0.000276858 | 2 | 153.573 | 1.42802 | 0.473 |
| train_cifar_67d45_00063 | TERMINATED | 127.0.0.1:10780 | 4 | 64 | 64 | 0.000333598 | 2 | 111.776 | 1.46765 | 0.4614 |
| train_cifar_67d45_00064 | TERMINATED | 127.0.0.1:12644 | 8 | 4 | 4 | 0.0181411 | 1 | 46.1321 | 2.31083 | 0.0981 |
| train_cifar_67d45_00065 | TERMINATED | 127.0.0.1:12428 | 4 | 256 | 128 | 0.0330339 | 1 | 57.841 | 2.26378 | 0.1518 |
| train_cifar_67d45_00066 | TERMINATED | 127.0.0.1:3968 | 8 | 128 | 16 | 0.0317081 | 1 | 45.5992 | 2.30786 | 0.1031 |
| train_cifar_67d45_00067 | TERMINATED | 127.0.0.1:8800 | 8 | 64 | 64 | 0.0691665 | 1 | 46.4963 | 2.31964 | 0.0967 |
| train_cifar_67d45_00068 | TERMINATED | 127.0.0.1:12428 | 2 | 128 | 128 | 0.0197163 | 1 | 78.1825 | 2.31514 | 0.1037 |
| train_cifar_67d45_00069 | TERMINATED | 127.0.0.1:13852 | 16 | 128 | 64 | 0.0161628 | 2 | 78.7702 | 1.54117 | 0.4443 |
| train_cifar_67d45_00070 | TERMINATED | 127.0.0.1:3556 | 8 | 8 | 8 | 0.000105775 | 1 | 46.147 | 2.30702 | 0.1009 |
| train_cifar_67d45_00071 | TERMINATED | 127.0.0.1:10060 | 4 | 8 | 4 | 0.000667863 | 2 | 112.191 | 1.62251 | 0.3845 |
| train_cifar_67d45_00072 | TERMINATED | 127.0.0.1:3612 | 4 | 32 | 16 | 0.00153857 | 2 | 109.401 | 1.45657 | 0.4853 |
| train_cifar_67d45_00073 | TERMINATED | 127.0.0.1:7888 | 4 | 64 | 4 | 0.00949178 | 1 | 58.5782 | 2.30817 | 0.1012 |
| train_cifar_67d45_00074 | TERMINATED | 127.0.0.1:8660 | 2 | 32 | 256 | 0.0674951 | 1 | 78.0406 | 2.33438 | 0.0992 |
| train_cifar_67d45_00075 | TERMINATED | 127.0.0.1:13436 | 8 | 16 | 256 | 0.0783057 | 1 | 45.6636 | 2.31096 | 0.1001 |
| train_cifar_67d45_00076 | TERMINATED | 127.0.0.1:7036 | 2 | 128 | 16 | 0.00196368 | 2 | 151.737 | 1.46339 | 0.4774 |
| train_cifar_67d45_00077 | TERMINATED | 127.0.0.1:12680 | 4 | 8 | 16 | 0.00249843 | 2 | 113.938 | 1.45827 | 0.4617 |
| train_cifar_67d45_00078 | TERMINATED | 127.0.0.1:14140 | 2 | 16 | 32 | 0.018761 | 1 | 77.8859 | 2.32846 | 0.0983 |
| train_cifar_67d45_00079 | TERMINATED | 127.0.0.1:2348 | 2 | 8 | 256 | 0.00177873 | 2 | 152.032 | 1.5823 | 0.4289 |
| train_cifar_67d45_00080 | TERMINATED | 127.0.0.1:2908 | 4 | 8 | 32 | 0.000288898 | 2 | 109.562 | 1.62298 | 0.4019 |
| train_cifar_67d45_00081 | TERMINATED | 127.0.0.1:7036 | 2 | 8 | 4 | 0.00354141 | 2 | 152.99 | 1.9352 | 0.2455 |
| train_cifar_67d45_00082 | TERMINATED | 127.0.0.1:10336 | 16 | 8 | 128 | 0.00496885 | 2 | 78.8535 | 1.43197 | 0.4919 |
| train_cifar_67d45_00083 | TERMINATED | 127.0.0.1:6172 | 2 | 32 | 64 | 0.00194701 | 2 | 152.252 | 1.56962 | 0.435 |
| train_cifar_67d45_00084 | TERMINATED | 127.0.0.1:13336 | 16 | 16 | 128 | 0.00256253 | 2 | 77.6169 | 1.3785 | 0.5008 |
| train_cifar_67d45_00085 | TERMINATED | 127.0.0.1:3448 | 4 | 16 | 256 | 0.0318124 | 1 | 55.7027 | 2.32376 | 0.0941 |
| train_cifar_67d45_00086 | TERMINATED | 127.0.0.1:3880 | 8 | 4 | 128 | 0.000121855 | 1 | 47.7121 | 2.26793 | 0.151 |
| train_cifar_67d45_00087 | TERMINATED | 127.0.0.1:13464 | 2 | 16 | 16 | 0.0307163 | 1 | 78.0027 | 2.36501 | 0.0994 |
| train_cifar_67d45_00088 | TERMINATED | 127.0.0.1:1536 | 16 | 64 | 64 | 0.000480643 | 1 | 40.9625 | 2.18846 | 0.201 |
| train_cifar_67d45_00089 | TERMINATED | 127.0.0.1:6076 | 4 | 8 | 64 | 0.0322508 | 1 | 56.9575 | 2.31536 | 0.1034 |
| train_cifar_67d45_00090 | TERMINATED | 127.0.0.1:13940 | 4 | 64 | 8 | 0.0027166 | 2 | 110.205 | 1.49607 | 0.4566 |
| train_cifar_67d45_00091 | TERMINATED | 127.0.0.1:13320 | 8 | 128 | 64 | 0.0377294 | 1 | 45.8307 | 2.31132 | 0.1025 |
| train_cifar_67d45_00092 | TERMINATED | 127.0.0.1:14276 | 4 | 8 | 128 | 0.000328398 | 2 | 112.965 | 1.45642 | 0.4654 |
| train_cifar_67d45_00093 | TERMINATED | 127.0.0.1:1700 | 2 | 4 | 32 | 0.000800404 | 2 | 151.438 | 1.64851 | 0.3835 |
| train_cifar_67d45_00094 | TERMINATED | 127.0.0.1:14324 | 8 | 256 | 8 | 0.000253253 | 1 | 45.7668 | 2.1415 | 0.2159 |
+-------------------------+------------+-----------------+--------------+------+------+-------------+--------+------------------+---------+------------+
(train_cifar pid=11188) [1, 6000] loss: 0.536
(train_cifar pid=11188) [1, 8000] loss: 0.392
(train_cifar pid=11188) [1, 10000] loss: 0.305
(train_cifar pid=11548) [1, 2000] loss: 2.160
(train_cifar pid=11548) [1, 4000] loss: 0.971
(train_cifar pid=11548) [1, 6000] loss: 0.621
(train_cifar pid=11548) [1, 8000] loss: 0.449
(train_cifar pid=11548) [1, 10000] loss: 0.357
(train_cifar pid=11548) [1, 12000] loss: 0.299
(train_cifar pid=11548) [1, 14000] loss: 0.248
(train_cifar pid=11548) [1, 16000] loss: 0.217
(train_cifar pid=11548) [1, 18000] loss: 0.190
(train_cifar pid=11548) [1, 20000] loss: 0.173
2022-08-11 19:24:07,997 ERROR tune.py:743 -- Trials did not complete: [train_cifar_67d45_00095, train_cifar_67d45_00096, train_cifar_67d45_00097, train_cifar_67d45_00098, train_cifar_67d45_00099, train_cifar_67d45_00100, train_cifar_67d45_00101, train_cifar_67d45_00102, train_cifar_67d45_00103, train_cifar_67d45_00104, train_cifar_67d45_00105, train_cifar_67d45_00106, train_cifar_67d45_00107, train_cifar_67d45_00108, train_cifar_67d45_00109, train_cifar_67d45_00110, train_cifar_67d45_00111, train_cifar_67d45_00112, train_cifar_67d45_00113, train_cifar_67d45_00114, train_cifar_67d45_00115, train_cifar_67d45_00116, train_cifar_67d45_00117, train_cifar_67d45_00118, train_cifar_67d45_00119, train_cifar_67d45_00120, train_cifar_67d45_00121, train_cifar_67d45_00122]
2022-08-11 19:24:07,997 INFO tune.py:747 -- Total run time: 4131.36 seconds (4057.01 seconds for the tuning loop).
2022-08-11 19:24:07,997 WARNING tune.py:753 -- Experiment has been interrupted, but the most recent state was saved. You can continue running this experiment by passing `resume=True` to `tune.run()`
(pid=) 2022-08-11 19:24:08,159 INFO context.py:67 -- Exec'ing worker with command: "d:\miniconda\envs\issue27646\python.exe" d:\miniconda\envs\issue27646\lib\site-packages\ray\workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=63578 --object-store-name=tcp://127.0.0.1:65094 --raylet-name=tcp://127.0.0.1:64190 --redis-address=None --storage=None --temp-dir=C:\Users\matti\AppData\Local\Temp\ray --metrics-agent-port=65337 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:65286 --redis-password=5241590000000000 --startup-token=119 --runtime-env-hash=-2001109766
(pid=) 2022-08-11 19:24:08,335 INFO context.py:67 -- Exec'ing worker with command: "d:\miniconda\envs\issue27646\python.exe" d:\miniconda\envs\issue27646\lib\site-packages\ray\workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=63578 --object-store-name=tcp://127.0.0.1:65094 --raylet-name=tcp://127.0.0.1:64190 --redis-address=None --storage=None --temp-dir=C:\Users\matti\AppData\Local\Temp\ray --metrics-agent-port=65337 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:65286 --redis-password=5241590000000000 --startup-token=120 --runtime-env-hash=-2001109766
Best trial config: {'l1': 128, 'l2': 256, 'lr': 0.0009443195440545546, 'batch_size': 4}
Best trial final validation loss: 1.2783956696420908
Best trial final validation accuracy: 0.5435
Files already downloaded and verified
Files already downloaded and verified
Best trial test set accuracy: 0.5406
@richardliaw any thoughts?
Using ray==2.0.0rc0 still crashes after ~124 trials. The information is saved, so restarting continues where it left off.
Using ray==2.0.0rc0 still crashes after ~124 trials. The information is saved, so restarting continues where it left off.
When I tested it, on another ~100 trials after resuming, it will crash again
Yes, it exits unexpectedly for me too after it saves the data and prints out a summary.
So I see this 2022-08-11 19:22:51,667 WARNING tune.py:682 -- Stop signal received (e.g. via SIGINT/Ctrl+C), ending Ray Tune run. This will try to checkpoint the experiment state one last time. Press CTRL+C (or send SIGINT/SIGKILL/SIGTERM) to skip.
in the log that you posted @mattip ...
right. The question is why? What could be sending a stop signal? I scanned the logs, all I could find was
[2022-08-11 19:24:20,138 E 14084 13392] (raylet.exe) agent_manager.cc:107: The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent was unexpectedly killed or failed. See "dashboard_agent.log" for the root cause.
but could not find a hint in the dashboard_agent.log
, something else must have sent a signal to the raylet.exe process or killed the agent.
@stephano41 when the process is halted, what is the last status report? For me it halts after about 1hr7min
== Status ==
Current time: 2022-08-11 19:22:53 (running for 01:07:37.02)
Memory usage on this node: 12.2/63.9 GiB
Using AsyncHyperBand: num_stopped=95
Bracket: Iter 2.000: -1.4806991674900054 | Iter 1.000: -2.035428786563873
Resources requested: 4.0/24 CPUs, 1.0/1 GPUs, 0.0/35.49 GiB heap, 0.0/17.75 GiB objects
Current best trial: 67d45_00008 with loss=1.2783956696420908 and parameters={'l1': 128, 'l2': 256, 'lr': 0.0009443195440545546, 'batch_size': 4}
Result logdir: d:\e\GRF_outputs2\train_cifar_2022-08-08_14-32-23
Number of trials: 123/250 (26 PENDING, 2 RUNNING, 95 TERMINATED
== Status ==
Current time: 2022-08-12 15:18:57 (running for 01:06:17.48)
Memory usage on this node: 12.9/63.9 GiB
Using AsyncHyperBand: num_stopped=96
Bracket: Iter 2.000: -1.4813302184641362 | Iter 1.000: -1.8759344378352165
Resources requested: 20.0/24 CPUs, 1.0/1 GPUs, 0.0/34.76 GiB heap, 0.0/17.38 GiB objects
Current best trial: adbad_00024 with loss=1.3153812784522771 and parameters={'l1': 256, 'l2': 8, 'lr': 0.0009968174381239548, 'batch_size': 4}
Result logdir: d:\e\GRF_outputs2\train_cifar_2022-08-08_14-32-23
Number of trials: 123/250 (25 PENDING, 2 RUNNING, 96 TERMINATED)
(train_cifar pid=8768) Files already downloaded and verified
2022-08-12 15:18:58,410 WARNING tune.py:686 -- Stop signal received (e.g. via SIGINT/Ctrl+C), ending Ray Tune run. This will try to checkpoint the experiment state one last time. Press CTRL+C (or send SIGINT/SIGKILL/SIGTERM) to skip.
I see an unusual message from the runtime_env
module at the end of the dashboard_agent.log
at the time the process is shutting down. It seems the TUNE_ORIG_WORKING_DIR
is used in both the script (for the data directory) and the RayTrialExecutor
in tune/execution/ray_trial_executor.py
2022-08-12 16:48:35,778 INFO runtime_env_agent.py:127 -- Unused runtime env {"env_vars": {"TUNE_ORIG_WORKING_DIR": "d:\\rundir"}}.
2022-08-12 16:48:35,778 INFO runtime_env_agent.py:233 -- Runtime env {"env_vars": {"TUNE_ORIG_WORKING_DIR": "d:\\rundir"}} deleted.
@stephano41 when the process is halted, what is the last status report? For me it halts after about 1hr7min
== Status == Current time: 2022-08-11 19:22:53 (running for 01:07:37.02) Memory usage on this node: 12.2/63.9 GiB Using AsyncHyperBand: num_stopped=95 Bracket: Iter 2.000: -1.4806991674900054 | Iter 1.000: -2.035428786563873 Resources requested: 4.0/24 CPUs, 1.0/1 GPUs, 0.0/35.49 GiB heap, 0.0/17.75 GiB objects Current best trial: 67d45_00008 with loss=1.2783956696420908 and parameters={'l1': 128, 'l2': 256, 'lr': 0.0009443195440545546, 'batch_size': 4} Result logdir: d:\e\GRF_outputs2\train_cifar_2022-08-08_14-32-23 Number of trials: 123/250 (26 PENDING, 2 RUNNING, 95 TERMINATED
== Status == Current time: 2022-08-12 15:18:57 (running for 01:06:17.48) Memory usage on this node: 12.9/63.9 GiB Using AsyncHyperBand: num_stopped=96 Bracket: Iter 2.000: -1.4813302184641362 | Iter 1.000: -1.8759344378352165 Resources requested: 20.0/24 CPUs, 1.0/1 GPUs, 0.0/34.76 GiB heap, 0.0/17.38 GiB objects Current best trial: adbad_00024 with loss=1.3153812784522771 and parameters={'l1': 256, 'l2': 8, 'lr': 0.0009968174381239548, 'batch_size': 4} Result logdir: d:\e\GRF_outputs2\train_cifar_2022-08-08_14-32-23 Number of trials: 123/250 (25 PENDING, 2 RUNNING, 96 TERMINATED) (train_cifar pid=8768) Files already downloaded and verified 2022-08-12 15:18:58,410 WARNING tune.py:686 -- Stop signal received (e.g. via SIGINT/Ctrl+C), ending Ray Tune run. This will try to checkpoint the experiment state one last time. Press CTRL+C (or send SIGINT/SIGKILL/SIGTERM) to skip.
I don't have one ready to copy and paste into this, but can confirm I a similar status report message before it exits, just different directory and names. I usually don't get any stack trace, however I do sometimes get the stack trace mentioned above:
2022-08-08 17:44:41,788 WARNING worker.py:1404 -- The log monitor on node DESKTOP-N18AJ6S failed with the following error:
OSError: [WinError 87] The parameter is incorrect
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "C:\Users\Steven\anaconda3\envs\GRF_hip_outcomes\lib\site-packages\ray\_private\log_monitor.py", line 451, in <module>
log_monitor.run()
File "C:\Users\Steven\anaconda3\envs\GRF_hip_outcomes\lib\site-packages\ray\_private\log_monitor.py", line 376, in run
self.open_closed_files()
File "C:\Users\Steven\anaconda3\envs\GRF_hip_outcomes\lib\site-packages\ray\_private\log_monitor.py", line 218, in open_closed_files
self.close_all_files()
File "C:\Users\Steven\anaconda3\envs\GRF_hip_outcomes\lib\site-packages\ray\_private\log_monitor.py", line 130, in close_all_files
os.kill(file_info.worker_pid, 0)
SystemError: <built-in function kill> returned a result with an error set
In the first script, after setting gpus_per_trial=0
it runs all 250 trials without failing in 5hr40min. I wonder if there is a problem with the torchvision GPU code causing a failure.
Updating pytorch to 1.13 did not solve the problem
conda install -c pytorch pytorch==1.12 torchvision=0.13
REM fix the "paging file too small problem"
del d:\miniconda\envs\issue27646\lib\site-packages\torch\lib\*.dll_bak
python d:\temp\fixNvPe.py --input d:\miniconda\envs\issue27646\lib\site-packages\torch\lib\*.dll
Carefully watching taskmanager at the time of the failure, I see a spike in GPU % utilization. It is pretty steady at 20% but then hits 100%. I don't know if that is the cause, or if that is a side-effect of the failure.
Hi, I have Windows and I am using Anaconda and would like to install ray to my current environment but it keeps giving me this error:
ERROR: Could not find a version that satisfies the requirement ray (from versions: none)
ERROR: No matching distribution found for ray
I have created a new environment and tried to install it again but it didn't work. Could you please help me with this one ?
Not connected to this issue, but note how I installed ray above. You must use pip, there is no ray conda package yet.
@mattip sorry couldn't find an appropriate issue for this one but followed your solution and it worked, thank you so much! =)
Perhaps my use of "must" is a bit extreme, there are conda packages just none yet for python3.10+ and ray 1.13+
I'm encountering this issue when using ray-1.13.0
through autogluon-0.5.2
on Windows / Python 3.9.3, everything installed with pip
(no conda
usage):
WARNING worker.py:1404 -- The log monitor on node DESKTOP-X4SMI53 failed with the following error:
OSError: [WinError 87] The parameter is incorrect
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "X:\gt\.pyenv\lib\site-packages\ray\_private\log_monitor.py", line 451, in <module>
log_monitor.run()
File "X:\gt\.pyenv\lib\site-packages\ray\_private\log_monitor.py", line 376, in run
self.open_closed_files()
File "X:\gt\.pyenv\lib\site-packages\ray\_private\log_monitor.py", line 218, in open_closed_files
self.close_all_files()
File "X:\gt\.pyenv\lib\site-packages\ray\_private\log_monitor.py", line 130, in close_all_files
os.kill(file_info.worker_pid, 0)
SystemError: <built-in function kill> returned a result with an error set
Traceback (most recent call last):
...
File "X:\gt\.pyenv\lib\site-packages\autogluon\core\models\ensemble\fold_fitting_strategy.py", line 421, in after_all_folds_scheduled
finished, unfinished = self.ray.wait(unfinished, num_returns=1)
File "X:\gt\.pyenv\lib\site-packages\ray\_private\client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "X:\gt\.pyenv\lib\site-packages\ray\worker.py", line 2015, in wait
ready_ids, remaining_ids = worker.core_worker.wait(
File "python\ray\_raylet.pyx", line 1403, in ray._raylet.CoreWorker.wait
File "python\ray\_raylet.pyx", line 169, in ray._raylet.check_status
KeyboardInterrupt
^C
(.pyenv) X:\gt>
I did not press Ctrl+C as seems to be suggested by the above stack trace
It only seems to happen if there are other Python processes running, not started or related to ray
. Windows is known to reuse pids
, so could it be some sort of race condition, a ray
process exits, an unrelated Python process starts (for example a short lived VS Code pylint
helper process), gets assigned the same pid
as the former ray
process and the ray
orechestrator somehow gets confused.
If I make sure no other Python processes are running except the autogluon
one which uses ray
the issue does not seem to happen (or maybe it happens much less frequently).
There is also a Python bug which sounds like it might be related, about os.kill
failing with the same WinError 87
in relation to Ctrl+C
:
https://bugs.python.org/issue42962
The exception happens here https://github.com/ray-project/ray/blob/73caaf926e4e508a2b2adcf76d719b2e46bf67f2/python/ray/_private/log_monitor.py#L506-L513
While this may be OK on linux, I am pretty sure it is not OK on windows.
How about using psutil
, which is already vendored by ray
, to determine if a pid
is alive?
I looked it up now and I see a number of possibilities.
-
psutil.pid_exists(pid)
- returnsFalse
if process is not alive -
psutil.Process(pid).status()
- seems to raisepsutil.NoSuchProcess
if the process is not alive -
psutil.Process(pid).is_running()
- the docs have a warning regarding this method:
Note: this will return True also if the process is a zombie (p.status() == psutil.STATUS_ZOMBIE).
I did some tests with psutil.pid_exists(pid)
and it seems to work as intended. It eventually calls native Windows APIs from C:
https://github.com/giampaolo/psutil/blob/446f4bfed1463f9016cf6bdd36f6b30700bec0ad/psutil/arch/windows/process_utils.c#L163-L189
For a little more context: see python issue 14480 and python issue 14484. It is not OK to call os.kill()
on windows except to terminate a process. Replacing this with psutil sounds like a good idea.
The hard part here will be writing a test. Does anyone know if there is an existing test that hits this code path?
An option would be keeping os.kill
on Linux/Mac and using psutil
just on Windows. This way there is no risk of breaking Linux/Mac.
I changed the os.kill to psutil in #30643
This minimal workaround would fix Windows without changing behavior on Linux/Mac:
# /ray/_private/log_monitor.py
def is_proc_alive(pid):
if sys.platform == "win32":
import psutil
try:
return psutil.Process(pid).is_running()
except psutil.NoSuchProcess:
return False
else:
try:
os.kill(pid, 0)
return True
except OSError:
# If OSError is raised, the process is not alive.
return False
I tried doing that in #30643, but it has unintended consequences. I don't understand how the change is influencing other tests.