ray icon indicating copy to clipboard operation
ray copied to clipboard

[Ray Tune, Core] Ray tune self terminates at ~100 trials on windows

Open stephano41 opened this issue 2 years ago • 1 comments

What happened + What you expected to happen

Running a ray tune experiment with pytorch and num_samples set at >100 trials results in ray self terminating at around 100 trials due to SIGINT. On resuming experiment, it will then self terminate at another ~100 trial. This only happens on windows, the same script will run to finish on linux

Expected behaviour: Ray tune completes all trials without sigint in between

Useful information: Error message can vary depending on where it terminates, as sometimes it seems to just print the trace stack at where it terminated.

However, sometimes it returns with:

2022-08-08 17:44:41,788 WARNING worker.py:1404 -- The log monitor on node DESKTOP-N18AJ6S failed with the following error:
OSError: [WinError 87] The parameter is incorrect

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\Users\Steven\anaconda3\envs\GRF_hip_outcomes\lib\site-packages\ray\_private\log_monitor.py", line 451, in <module>
    log_monitor.run()
  File "C:\Users\Steven\anaconda3\envs\GRF_hip_outcomes\lib\site-packages\ray\_private\log_monitor.py", line 376, in run
    self.open_closed_files()
  File "C:\Users\Steven\anaconda3\envs\GRF_hip_outcomes\lib\site-packages\ray\_private\log_monitor.py", line 218, in open_closed_files
    self.close_all_files()
  File "C:\Users\Steven\anaconda3\envs\GRF_hip_outcomes\lib\site-packages\ray\_private\log_monitor.py", line 130, in close_all_files
    os.kill(file_info.worker_pid, 0)
SystemError: <built-in function kill> returned a result with an error set

Versions / Dependencies

Ray tune 1.13.0 pytorch 1.11.0 windows 10/11

Reproduction script

This can be reproduced with the example ray tune pytorch script listed in Ray Tune documentation:

from __future__ import annotations

from functools import partial
import numpy as np
import os
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from filelock import FileLock
from torch.utils.data import random_split
import torchvision
import torchvision.transforms as transforms
import ray
from ray import tune
from ray.tune.schedulers import ASHAScheduler

def load_data(data_dir="./data"):
    transform = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
    ])

    # We add FileLock here because multiple workers will want to
    # download data, and this may cause overwrites since
    # DataLoader is not threadsafe.
    with FileLock(os.path.expanduser("~/.data.lock")):
        trainset = torchvision.datasets.CIFAR10(
            root=data_dir, train=True, download=True, transform=transform)

        testset = torchvision.datasets.CIFAR10(
            root=data_dir, train=False, download=True, transform=transform)

    return trainset, testset

class Net(nn.Module):
    def __init__(self, l1=120, l2=84):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, l1)
        self.fc2 = nn.Linear(l1, l2)
        self.fc3 = nn.Linear(l2, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 16 * 5 * 5)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x
    
def train_cifar(config, checkpoint_dir=None):
    project_dir = os.getenv('TUNE_ORIG_WORKING_DIR')

    net = Net(config["l1"], config["l2"])

    device = "cpu"
    if torch.cuda.is_available():
        device = "cuda:0"
        if torch.cuda.device_count() > 1:
            net = nn.DataParallel(net)
    net.to(device)

    criterion = nn.CrossEntropyLoss()
    optimizer = optim.SGD(net.parameters(), lr=config["lr"], momentum=0.9)

    # The `checkpoint_dir` parameter gets passed by Ray Tune when a checkpoint
    # should be restored.
    if checkpoint_dir:
        checkpoint = os.path.join(checkpoint_dir, "checkpoint")
        model_state, optimizer_state = torch.load(checkpoint)
        net.load_state_dict(model_state)
        optimizer.load_state_dict(optimizer_state)

    data_dir = os.path.join(project_dir, "data")
    trainset, testset = load_data(data_dir)

    test_abs = int(len(trainset) * 0.8)
    train_subset, val_subset = random_split(
        trainset, [test_abs, len(trainset) - test_abs])

    trainloader = torch.utils.data.DataLoader(
        train_subset,
        batch_size=int(config["batch_size"]),
        shuffle=True,
        num_workers=8)
    valloader = torch.utils.data.DataLoader(
        val_subset,
        batch_size=int(config["batch_size"]),
        shuffle=True,
        num_workers=8)

    for epoch in range(10):  # loop over the dataset multiple times
        running_loss = 0.0
        epoch_steps = 0
        for i, data in enumerate(trainloader, 0):
            # get the inputs; data is a list of [inputs, labels]
            inputs, labels = data
            inputs, labels = inputs.to(device), labels.to(device)

            # zero the parameter gradients
            optimizer.zero_grad()

            # forward + backward + optimize
            outputs = net(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            # print statistics
            running_loss += loss.item()
            epoch_steps += 1
            if i % 2000 == 1999:  # print every 2000 mini-batches
                print("[%d, %5d] loss: %.3f" % (epoch + 1, i + 1,
                                                running_loss / epoch_steps))
                running_loss = 0.0

        # Validation loss
        val_loss = 0.0
        val_steps = 0
        total = 0
        correct = 0
        for i, data in enumerate(valloader, 0):
            with torch.no_grad():
                inputs, labels = data
                inputs, labels = inputs.to(device), labels.to(device)

                outputs = net(inputs)
                _, predicted = torch.max(outputs.data, 1)
                total += labels.size(0)
                correct += (predicted == labels).sum().item()

                loss = criterion(outputs, labels)
                val_loss += loss.cpu().numpy()
                val_steps += 1

        # Here we save a checkpoint. It is automatically registered with
        # Ray Tune and will potentially be passed as the `checkpoint_dir`
        # parameter in future iterations.
        with tune.checkpoint_dir(step=epoch) as checkpoint_dir:
            path = os.path.join(checkpoint_dir, "checkpoint")
            torch.save(
                (net.state_dict(), optimizer.state_dict()), path)

        tune.report(loss=(val_loss / val_steps), accuracy=correct / total)
    print("Finished Training")
    
    

def test_best_model(best_trial):
    best_trained_model = Net(best_trial.config["l1"], best_trial.config["l2"])
    device = "cuda:0" if torch.cuda.is_available() else "cpu"
    best_trained_model.to(device)

    checkpoint_path = os.path.join(best_trial.checkpoint.value, "checkpoint")

    model_state, optimizer_state = torch.load(checkpoint_path)
    best_trained_model.load_state_dict(model_state)

    trainset, testset = load_data()

    testloader = torch.utils.data.DataLoader(
        testset, batch_size=4, shuffle=False, num_workers=2)

    correct = 0
    total = 0
    with torch.no_grad():
        for data in testloader:
            images, labels = data
            images, labels = images.to(device), labels.to(device)
            outputs = best_trained_model(images)
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()


    print("Best trial test set accuracy: {}".format(correct / total))


def main(num_samples=10, max_num_epochs=10, gpus_per_trial: int|float=2):
    config = {
        "l1": tune.sample_from(lambda _: 2 ** np.random.randint(2, 9)),
        "l2": tune.sample_from(lambda _: 2 ** np.random.randint(2, 9)),
        "lr": tune.loguniform(1e-4, 1e-1),
        "batch_size": tune.choice([2, 4, 8, 16])
    }
    scheduler = ASHAScheduler(
        max_t=max_num_epochs,
        grace_period=1,
        reduction_factor=2)
    result = tune.run(
        tune.with_parameters(train_cifar),
        resources_per_trial={"cpu": 2, "gpu": gpus_per_trial},
        config=config,
        metric="loss",
        mode="min",
        num_samples=num_samples,
        scheduler=scheduler,
        verbose=2,
        local_dir='/e/GRF_outputs2',
        name='train_cifar_2022-08-08_14-32-23',
        resume=True
    )

    best_trial = result.get_best_trial("loss", "min", "last")
    print("Best trial config: {}".format(best_trial.config))
    print("Best trial final validation loss: {}".format(
        best_trial.last_result["loss"]))
    print("Best trial final validation accuracy: {}".format(
        best_trial.last_result["accuracy"]))

    if ray.util.client.ray.is_connected():
        # If using Ray Client, we want to make sure checkpoint access
        # happens on the server. So we wrap `test_best_model` in a Ray task.
        # We have to make sure it gets executed on the same node that
        # ``tune.run`` is called on.
        from ray.util.ml_utils.node import force_on_current_node
        remote_fn = force_on_current_node(ray.remote(test_best_model))
        ray.get(remote_fn.remote(best_trial))
    else:
        test_best_model(best_trial)


if __name__ == "__main__":
    # You can change the number of GPUs per trial here:
    main(num_samples=250, max_num_epochs=2, gpus_per_trial=0.5)

the most strip down script I've attempted without losing machinie learning is:

import logging
import os
import ray
import torch
from ray import tune
from ray.tune import CLIReporter
from torch.utils.data import TensorDataset, DataLoader

from torch import nn


class DummyModel(nn.Module):
    def __init__(self, a, output_size, channel_n=9):
        super().__init__()
        self.output_size=output_size

        self.layers = nn.Sequential(nn.Linear(1,a),
                                    nn.ReLU(inplace=True),
                                    nn.Linear(a,output_size))

        self.softmax = nn.LogSoftmax(dim=1)

    def forward(self, x):
        x = x.view(-1, 1)
        out = self.layers(x)
        return self.softmax(out)

os.environ["TUNE_DISABLE_AUTO_CALLBACK_SYNCER"] = "1"

logger = logging.getLogger(__name__)


def get_dataloader(batch_size=50, num_workers=4):
    class1 = [2, 4, 6, 8, 10, 12, 14, 16]
    class2 = [3, 6, 9, 12, 15, 18, 21, 24, 27]
    class3 = [1, 5, 7, 11, 13, 17, 19]

    data_x = torch.tensor(class1 + class2 + class3, dtype=torch.float)
    data_y = torch.tensor(len(class1) * [0] + len(class2) * [1] + len(class3) * [2])
    dataset = TensorDataset(data_x, data_y)

    return DataLoader(dataset, batch_size=batch_size, num_workers=num_workers)


def train_func(config, epochs=2):
    # cwd is changed to the trial folder
    batch_size = config["batch_size"]
    weight_decay = config['weight_decay']
    a = config["a"]
    lr = config['lr']

    # setup code
    device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

    # setup dataloaders
    data_loader = get_dataloader(batch_size, num_workers=4)

    output_size = 3
    channel_n = 9
    # setup model
    model = DummyModel(a=a, output_size=output_size, channel_n=channel_n)

    model = model.to(device)

    criterion = torch.nn.NLLLoss()

    optimizer = torch.optim.AdamW(model.parameters(), lr=lr, weight_decay=weight_decay, amsgrad=True)

    for epoch in range(0, epochs):  # loop over the dataset multiple times
        log = {}
        model.train()
        for i, data in enumerate(data_loader, 0):
            # get the inputs; data is a list of [inputs, labels]
            inputs, targets = data
            inputs, targets = inputs.to(device), targets.to(device)

            optimizer.zero_grad()

            # forward + backward
            outputs = model(inputs)
            loss = criterion(outputs, targets)

            loss.backward()
            optimizer.step()
            log = {"loss": loss.item()}

        # log metrics, log in checkpoint in case actor dies half way
        tune.report(**log)


def main():
    tune.run(
        train_func,
        num_samples=250,
        metric='loss',
        mode="min",
        resources_per_trial={"cpu": 4, "gpu": 0.15},
        local_dir="outputs/tune-dummy",
        verbose=2,
        config={
            "lr": ray.tune.loguniform(0.0001, 0.01),
            "weight_decay": ray.tune.loguniform(0.001, 0.01),
            "batch_size": ray.tune.choice([128, 256]),
            "a": ray.tune.randint(1, 30)
        },
        progress_reporter=CLIReporter(metric_columns=["training_iteration", "loss"])
    )

    print("I finished!")


if __name__ == '__main__':
    main()

Issue Severity

Medium: It is a significant difficulty but I can work around it.

stephano41 avatar Aug 08 '22 09:08 stephano41

@mattip is this something you can take a look at?

richardliaw avatar Aug 10 '22 17:08 richardliaw

What version of CUDA are you using?

mattip avatar Aug 11 '22 12:08 mattip

@mattip CUDA 11.3, but have tried running this on just cpu (with 6 concurrent trials) and still run into the same problem

stephano41 avatar Aug 11 '22 12:08 stephano41

I could not reproduce when using the resource settings below, the second script ran to completion. I have a windows 10 machine with 64GB RAM. resources_per_trial={"cpu": 6, "gpu": 0, "memory": 16 * 1024 * 1024 * 1024,},

mattip avatar Aug 11 '22 13:08 mattip

I created the environment via

>conda create -n issue27646 python=3.9
>conda activate issue27646
>conda install --channel pytorch pytorch=1.11 pip pandas
>pip install "ray==1.13" "ray[tune]"

mattip avatar Aug 11 '22 13:08 mattip

I tried to run the first script. After I run conda install -c pytorch torchvision=0.12 I needed to set resume="AUTO", and then after a while I get

OSError: [WinError 1455] The paging file is too small for this operation to complete. Error loading "d:\miniconda\envs\issue27646\lib\site-packages\torch\lib\cudnn_cnn_infer64_8.dll" or one of its dependencies.

This is after I updated to the latest CUDA 11.7, following this stack overflow which has a note that starting from CUDA 11.7 they will change the compilation flags so more memory can be shared between dlls. I used the script in that answer to modify the DLLs in the directory mentioned above, and reran the script. This time it started running.

After about 4000 seconds, the run stopped with a message WARNING tune.py:682 -- Stop signal received (e.g. via SIGINT/Ctrl+C), ending Ray Tune run. This will try to checkpoint the experiment state one last time. Press CTRL+C (or send SIGINT/SIGKILL/SIGTERM) to skip. I didn't touch the machine, so something caused a failure. In raylet.out I see

[2022-08-11 19:24:20,078 I 14084 8692] (raylet.exe) main.cc:291: Raylet received SIGTERM, shutting down...

[2022-08-11 19:24:20,078 I 14084 8692] (raylet.exe) accessor.cc:430: Unregistering node info, node id = 0157f63c3e0cc3b6d2b4cc1fd4e190963b3007c468668082a0396e62

[2022-08-11 19:24:20,086 I 14084 8692] (raylet.exe) io_service_pool.cc:47: IOServicePool is stopped.

[2022-08-11 19:24:20,138 W 14084 13392] (raylet.exe) agent_manager.cc:104: Agent process with pid 7232 exit, return value 21. ip 127.0.0.1. pid 7232

[2022-08-11 19:24:20,138 E 14084 13392] (raylet.exe) agent_manager.cc:107: The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent was unexpectedly killed or failed. See `dashboard_agent.log` for the root cause.

but I don't see any errors in the dashboar_agent.log.

last few hundred lines of console output
== Status ==
Current time: 2022-08-11 19:22:44 (running for 01:07:27.97)
Memory usage on this node: 15.4/63.9 GiB
Using AsyncHyperBand: num_stopped=94
Bracket: Iter 2.000: -1.480599593940377 | Iter 1.000: -2.035428786563873
Resources requested: 4.0/24 CPUs, 1.0/1 GPUs, 0.0/35.49 GiB heap, 0.0/17.75 GiB objects
Current best trial: 67d45_00008 with loss=1.2783956696420908 and parameters={'l1': 128, 'l2': 256, 'lr': 0.0009443195440545546, 'batch_size': 4}
Result logdir: d:\e\GRF_outputs2\train_cifar_2022-08-08_14-32-23
Number of trials: 122/250 (26 PENDING, 2 RUNNING, 94 TERMINATED)


Trial train_cifar_67d45_00093 reported loss=1.65 with parameters={'l1': 4, 'l2': 32, 'lr': 0.0008004035146948047, 'batch_size': 2}. This trial completed.
(pid=) 2022-08-11 19:22:48,652  INFO context.py:67 -- Exec'ing worker with command: "d:\miniconda\envs\issue27646\python.exe" d:\miniconda\envs\issue27646\lib\site-packages\ray\workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=63578 --object-store-name=tcp://127.0.0.1:65094 --raylet-name=tcp://127.0.0.1:64190 --redis-address=None --storage=None --temp-dir=C:\Users\matti\AppData\Local\Temp\ray --metrics-agent-port=65337 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:65286 --redis-password=5241590000000000 --startup-token=117 --runtime-env-hash=-2001109766
(train_cifar pid=11188) [1,  2000] loss: 2.030
(train_cifar pid=11548) Files already downloaded and verified
2022-08-11 19:22:51,667 WARNING tune.py:682 -- Stop signal received (e.g. via SIGINT/Ctrl+C), ending Ray Tune run. This will try to checkpoint the experiment state one last time. Press CTRL+C (or send SIGINT/SIGKILL/SIGTERM) to skip.
(train_cifar pid=11548) Files already downloaded and verified
(train_cifar pid=11188) [1,  4000] loss: 0.853
== Status ==
Current time: 2022-08-11 19:22:53 (running for 01:07:37.01)
Memory usage on this node: 12.2/63.9 GiB
Using AsyncHyperBand: num_stopped=95
Bracket: Iter 2.000: -1.4806991674900054 | Iter 1.000: -2.035428786563873
Resources requested: 4.0/24 CPUs, 1.0/1 GPUs, 0.0/35.49 GiB heap, 0.0/17.75 GiB objects
Current best trial: 67d45_00008 with loss=1.2783956696420908 and parameters={'l1': 128, 'l2': 256, 'lr': 0.0009443195440545546, 'batch_size': 4}
Result logdir: d:\e\GRF_outputs2\train_cifar_2022-08-08_14-32-23
Number of trials: 123/250 (26 PENDING, 2 RUNNING, 95 TERMINATED)


== Status ==
Current time: 2022-08-11 19:22:53 (running for 01:07:37.02)
Memory usage on this node: 12.2/63.9 GiB
Using AsyncHyperBand: num_stopped=95
Bracket: Iter 2.000: -1.4806991674900054 | Iter 1.000: -2.035428786563873
Resources requested: 4.0/24 CPUs, 1.0/1 GPUs, 0.0/35.49 GiB heap, 0.0/17.75 GiB objects
Current best trial: 67d45_00008 with loss=1.2783956696420908 and parameters={'l1': 128, 'l2': 256, 'lr': 0.0009443195440545546, 'batch_size': 4}
Result logdir: d:\e\GRF_outputs2\train_cifar_2022-08-08_14-32-23
Number of trials: 123/250 (26 PENDING, 2 RUNNING, 95 TERMINATED)
+-------------------------+------------+-----------------+--------------+------+------+-------------+--------+------------------+---------+------------+
| Trial name              | status     | loc             |   batch_size |   l1 |   l2 |          lr |   iter |   total time (s) |    loss |   accuracy |
|-------------------------+------------+-----------------+--------------+------+------+-------------+--------+------------------+---------+------------|
| train_cifar_67d45_00095 | RUNNING    | 127.0.0.1:11188 |            4 |   16 |  256 | 0.00248724  |        |                  |         |            |
| train_cifar_67d45_00096 | RUNNING    | 127.0.0.1:11548 |            2 |  128 |   16 | 0.00299209  |        |                  |         |            |
| train_cifar_67d45_00097 | PENDING    |                 |            8 |    8 |   32 | 0.0011024   |        |                  |         |            |
| train_cifar_67d45_00098 | PENDING    |                 |           16 |    8 |   32 | 0.000731722 |        |                  |         |            |
| train_cifar_67d45_00099 | PENDING    |                 |            4 |    4 |    8 | 0.00119135  |        |                  |         |            |
| train_cifar_67d45_00100 | PENDING    |                 |           16 |  256 |  256 | 0.0024511   |        |                  |         |            |
| train_cifar_67d45_00101 | PENDING    |                 |           16 |   64 |   16 | 0.000200764 |        |                  |         |            |
| train_cifar_67d45_00102 | PENDING    |                 |            4 |  128 |   64 | 0.0108665   |        |                  |         |            |
| train_cifar_67d45_00103 | PENDING    |                 |            8 |   32 |    4 | 0.00048593  |        |                  |         |            |
| train_cifar_67d45_00104 | PENDING    |                 |            2 |   16 |   32 | 0.00235333  |        |                  |         |            |
| train_cifar_67d45_00105 | PENDING    |                 |            4 |  256 |   16 | 0.00226565  |        |                  |         |            |
| train_cifar_67d45_00106 | PENDING    |                 |            4 |   32 |   16 | 0.0681302   |        |                  |         |            |
| train_cifar_67d45_00107 | PENDING    |                 |            2 |   16 |   16 | 0.0211689   |        |                  |         |            |
| train_cifar_67d45_00108 | PENDING    |                 |            4 |    4 |  256 | 0.000285824 |        |                  |         |            |
| train_cifar_67d45_00109 | PENDING    |                 |            8 |   16 |  256 | 0.0564388   |        |                  |         |            |
| train_cifar_67d45_00110 | PENDING    |                 |            2 |   32 |    4 | 0.00156365  |        |                  |         |            |
| train_cifar_67d45_00111 | PENDING    |                 |           16 |   64 |    4 | 0.0973652   |        |                  |         |            |
| train_cifar_67d45_00112 | PENDING    |                 |           16 |    8 |    8 | 0.00774178  |        |                  |         |            |
| train_cifar_67d45_00113 | PENDING    |                 |            8 |  256 |   64 | 0.00879403  |        |                  |         |            |
| train_cifar_67d45_00114 | PENDING    |                 |            2 |  128 |  256 | 0.000352467 |        |                  |         |            |
| train_cifar_67d45_00115 | PENDING    |                 |           16 |   32 |    4 | 0.000714886 |        |                  |         |            |
| train_cifar_67d45_00116 | PENDING    |                 |            8 |    8 |   64 | 0.00794146  |        |                  |         |            |
| train_cifar_67d45_00117 | PENDING    |                 |            8 |    8 |    4 | 0.000130055 |        |                  |         |            |
| train_cifar_67d45_00118 | PENDING    |                 |            4 |    4 |   64 | 0.000361327 |        |                  |         |            |
| train_cifar_67d45_00119 | PENDING    |                 |            2 |   16 |    4 | 0.0193271   |        |                  |         |            |
| train_cifar_67d45_00120 | PENDING    |                 |            4 |   64 |   16 | 0.00325031  |        |                  |         |            |
| train_cifar_67d45_00121 | PENDING    |                 |           16 |  256 |   64 | 0.0396648   |        |                  |         |            |
| train_cifar_67d45_00122 | PENDING    |                 |           16 |   32 |   16 | 0.000303553 |        |                  |         |            |
| train_cifar_67d45_00000 | TERMINATED | 127.0.0.1:336   |           16 |   16 |    4 | 0.0124023   |      2 |          77.2554 | 1.59222 |     0.4155 |
| train_cifar_67d45_00001 | TERMINATED | 127.0.0.1:11812 |            4 |   16 |   64 | 0.00403809  |      2 |         110.609  | 1.52443 |     0.4596 |
| train_cifar_67d45_00002 | TERMINATED | 127.0.0.1:13492 |            8 |    4 |    4 | 0.0913862   |      1 |          45.6294 | 2.31545 |     0.0994 |
| train_cifar_67d45_00003 | TERMINATED | 127.0.0.1:4608  |           16 |  128 |    8 | 0.0640589   |      1 |          40.6615 | 2.31283 |     0.0942 |
| train_cifar_67d45_00004 | TERMINATED | 127.0.0.1:3704  |            2 |    8 |  256 | 0.000150398 |      2 |         151.123  | 1.4805  |     0.4549 |
| train_cifar_67d45_00005 | TERMINATED | 127.0.0.1:4064  |           16 |    8 |    4 | 0.00111178  |      1 |          41.0484 | 2.17909 |     0.198  |
| train_cifar_67d45_00006 | TERMINATED | 127.0.0.1:3968  |           16 |   32 |  256 | 0.00632639  |      2 |          77.9044 | 1.30005 |     0.534  |
| train_cifar_67d45_00007 | TERMINATED | 127.0.0.1:10868 |            8 |    4 |   16 | 0.0250599   |      1 |          46.0688 | 2.31255 |     0.1043 |
| train_cifar_67d45_00008 | TERMINATED | 127.0.0.1:11900 |            4 |  128 |  256 | 0.00094432  |      2 |         111.369  | 1.2784  |     0.5435 |
| train_cifar_67d45_00009 | TERMINATED | 127.0.0.1:2348  |            2 |   64 |  128 | 0.000111395 |      1 |          78.9413 | 1.94607 |     0.2959 |
| train_cifar_67d45_00010 | TERMINATED | 127.0.0.1:10284 |            2 |  256 |    4 | 0.00148054  |      1 |          79.7107 | 2.30502 |     0.1024 |
| train_cifar_67d45_00011 | TERMINATED | 127.0.0.1:9696  |            8 |   64 |    8 | 0.000299093 |      1 |          46.096  | 2.23853 |     0.1884 |
| train_cifar_67d45_00012 | TERMINATED | 127.0.0.1:12260 |           16 |    8 |    4 | 0.0101674   |      2 |          77.8852 | 1.59929 |     0.4011 |
| train_cifar_67d45_00013 | TERMINATED | 127.0.0.1:8476  |            8 |   32 |   16 | 0.0124174   |      2 |          87.9289 | 1.71918 |     0.3633 |
| train_cifar_67d45_00014 | TERMINATED | 127.0.0.1:1516  |            8 |   16 |   32 | 0.0323596   |      1 |          45.8615 | 2.31084 |     0.0981 |
| train_cifar_67d45_00015 | TERMINATED | 127.0.0.1:9808  |           16 |  128 |    4 | 0.000842029 |      2 |          77.9852 | 1.77241 |     0.3257 |
| train_cifar_67d45_00016 | TERMINATED | 127.0.0.1:13544 |            8 |   64 |    8 | 0.000126637 |      1 |          46.2296 | 2.30141 |     0.1111 |
| train_cifar_67d45_00017 | TERMINATED | 127.0.0.1:9628  |            2 |  128 |  128 | 0.000865456 |      2 |         157.568  | 1.34067 |     0.518  |
| train_cifar_67d45_00018 | TERMINATED | 127.0.0.1:10988 |            8 |    4 |    4 | 0.000133137 |      1 |          46.7045 | 2.31236 |     0.0943 |
| train_cifar_67d45_00019 | TERMINATED | 127.0.0.1:11440 |            2 |   64 |  256 | 0.000839588 |      2 |         157.86   | 1.34019 |     0.5154 |
| train_cifar_67d45_00020 | TERMINATED | 127.0.0.1:7180  |            4 |    8 |   64 | 0.00176578  |      2 |         112.179  | 1.35746 |     0.5088 |
| train_cifar_67d45_00021 | TERMINATED | 127.0.0.1:2072  |            8 |   32 |   64 | 0.0359645   |      1 |          45.7077 | 2.3174  |     0.0997 |
| train_cifar_67d45_00022 | TERMINATED | 127.0.0.1:6648  |           16 |   32 |    4 | 0.0691986   |      1 |          40.5521 | 2.31416 |     0.1017 |
| train_cifar_67d45_00023 | TERMINATED | 127.0.0.1:12884 |           16 |   16 |   16 | 0.0348501   |      1 |          40.1453 | 1.98061 |     0.2451 |
| train_cifar_67d45_00024 | TERMINATED | 127.0.0.1:3928  |           16 |    4 |    8 | 0.00240846  |      1 |          40.6082 | 2.0129  |     0.2385 |
| train_cifar_67d45_00025 | TERMINATED | 127.0.0.1:14028 |            2 |   16 |  256 | 0.0916574   |      1 |          77.8128 | 2.40566 |     0.0987 |
| train_cifar_67d45_00026 | TERMINATED | 127.0.0.1:1308  |            8 |   16 |   64 | 0.000723566 |      2 |          86.6233 | 1.4807  |     0.4534 |
| train_cifar_67d45_00027 | TERMINATED | 127.0.0.1:3740  |           16 |   16 |   64 | 0.000106305 |      1 |          40.1516 | 2.30268 |     0.1006 |
| train_cifar_67d45_00028 | TERMINATED | 127.0.0.1:4216  |           16 |   64 |   32 | 0.000128329 |      1 |          40.4751 | 2.30131 |     0.1052 |
| train_cifar_67d45_00029 | TERMINATED | 127.0.0.1:12724 |            2 |   32 |   32 | 0.0868636   |      1 |          76.6885 | 2.43158 |     0.1006 |
| train_cifar_67d45_00030 | TERMINATED | 127.0.0.1:6016  |            4 |   16 |  128 | 0.000196543 |      2 |         111.905  | 1.57292 |     0.4263 |
| train_cifar_67d45_00031 | TERMINATED | 127.0.0.1:2380  |           16 |    4 |   16 | 0.00353314  |      2 |          76.4946 | 1.47283 |     0.4526 |
| train_cifar_67d45_00032 | TERMINATED | 127.0.0.1:6264  |            4 |   64 |   32 | 0.00470306  |      2 |         110.682  | 1.61346 |     0.4188 |
| train_cifar_67d45_00033 | TERMINATED | 127.0.0.1:2660  |            4 |   32 |   16 | 0.000171846 |      1 |          56.9745 | 2.061   |     0.2028 |
| train_cifar_67d45_00034 | TERMINATED | 127.0.0.1:11748 |            8 |    4 |    4 | 0.0356921   |      1 |          46.5302 | 2.31146 |     0.1007 |
| train_cifar_67d45_00035 | TERMINATED | 127.0.0.1:13868 |            8 |  256 |  256 | 0.00027364  |      1 |          46.8195 | 2.02851 |     0.2424 |
| train_cifar_67d45_00036 | TERMINATED | 127.0.0.1:4852  |            2 |    8 |  128 | 0.0272566   |      1 |          78.8351 | 2.34708 |     0.099  |
| train_cifar_67d45_00037 | TERMINATED | 127.0.0.1:3896  |            4 |    8 |   64 | 0.0293812   |      1 |          58.2364 | 2.31463 |     0.0978 |
| train_cifar_67d45_00038 | TERMINATED | 127.0.0.1:13252 |            2 |    4 |   64 | 0.000994105 |      2 |         149.891  | 1.47705 |     0.4587 |
| train_cifar_67d45_00039 | TERMINATED | 127.0.0.1:11456 |            4 |  128 |   32 | 0.00078335  |      2 |         112.726  | 1.35677 |     0.5105 |
| train_cifar_67d45_00040 | TERMINATED | 127.0.0.1:10568 |           16 |    8 |    4 | 0.00592756  |      2 |          78.1619 | 1.8449  |     0.2313 |
| train_cifar_67d45_00041 | TERMINATED | 127.0.0.1:12360 |            4 |   64 |  128 | 0.000160305 |      1 |          55.9045 | 2.03543 |     0.234  |
| train_cifar_67d45_00042 | TERMINATED | 127.0.0.1:13544 |            4 |  256 |  128 | 0.00492324  |      2 |         108.62   | 1.55762 |     0.4429 |
| train_cifar_67d45_00043 | TERMINATED | 127.0.0.1:11532 |            8 |   32 |    8 | 0.000333664 |      1 |          46.5984 | 2.27785 |     0.1753 |
| train_cifar_67d45_00044 | TERMINATED | 127.0.0.1:2348  |            2 |   32 |  128 | 0.000241014 |      2 |         155      | 1.43656 |     0.4784 |
| train_cifar_67d45_00045 | TERMINATED | 127.0.0.1:4280  |           16 |    8 |    8 | 0.000284191 |      1 |          40.9402 | 2.28762 |     0.1432 |
| train_cifar_67d45_00046 | TERMINATED | 127.0.0.1:2952  |            8 |   64 |  128 | 0.00545612  |      2 |          89.7666 | 1.37725 |     0.5139 |
| train_cifar_67d45_00047 | TERMINATED | 127.0.0.1:13204 |            2 |   16 |    8 | 0.00372381  |      2 |         149.886  | 1.96147 |     0.2749 |
| train_cifar_67d45_00048 | TERMINATED | 127.0.0.1:9816  |           16 |    4 |    4 | 0.0129934   |      2 |          76.3272 | 1.70723 |     0.3373 |
| train_cifar_67d45_00049 | TERMINATED | 127.0.0.1:11536 |           16 |    4 |  128 | 0.000137002 |      1 |          40.6608 | 2.23616 |     0.1501 |
| train_cifar_67d45_00050 | TERMINATED | 127.0.0.1:12232 |            2 |  256 |    4 | 0.000396095 |      2 |         151.73   | 1.53373 |     0.4467 |
| train_cifar_67d45_00051 | TERMINATED | 127.0.0.1:12620 |            4 |   16 |    4 | 0.0019273   |      2 |         113.352  | 1.60781 |     0.3987 |
| train_cifar_67d45_00052 | TERMINATED | 127.0.0.1:12420 |            2 |    8 |   32 | 0.00083404  |      2 |         152.35   | 1.42813 |     0.479  |
| train_cifar_67d45_00053 | TERMINATED | 127.0.0.1:8036  |            4 |   64 |   32 | 0.0167031   |      1 |          58.3754 | 2.15813 |     0.1692 |
| train_cifar_67d45_00054 | TERMINATED | 127.0.0.1:4376  |            2 |  128 |   16 | 0.0185858   |      1 |          79.929  | 2.3207  |     0.1014 |
| train_cifar_67d45_00055 | TERMINATED | 127.0.0.1:1224  |            8 |   64 |  128 | 0.00030277  |      1 |          47.0069 | 2.07092 |     0.2412 |
| train_cifar_67d45_00056 | TERMINATED | 127.0.0.1:14256 |            8 |   16 |    4 | 0.00138638  |      1 |          45.2928 | 2.29623 |     0.1288 |
| train_cifar_67d45_00057 | TERMINATED | 127.0.0.1:13172 |            4 |  256 |  128 | 0.041717    |      1 |          56.7852 | 2.33236 |     0.1046 |
| train_cifar_67d45_00058 | TERMINATED | 127.0.0.1:864   |            2 |   16 |  256 | 0.000721058 |      2 |         153.45   | 1.44545 |     0.4873 |
| train_cifar_67d45_00059 | TERMINATED | 127.0.0.1:13868 |           16 |  128 |   16 | 0.000218716 |      1 |          40.154  | 2.29752 |     0.1432 |
| train_cifar_67d45_00060 | TERMINATED | 127.0.0.1:10920 |            2 |    8 |    4 | 0.0107507   |      1 |          79.4329 | 2.3216  |     0.097  |
| train_cifar_67d45_00061 | TERMINATED | 127.0.0.1:5440  |           16 |   32 |   64 | 0.000724089 |      1 |          40.6402 | 2.06584 |     0.2631 |
| train_cifar_67d45_00062 | TERMINATED | 127.0.0.1:4336  |            2 |    8 |   16 | 0.000276858 |      2 |         153.573  | 1.42802 |     0.473  |
| train_cifar_67d45_00063 | TERMINATED | 127.0.0.1:10780 |            4 |   64 |   64 | 0.000333598 |      2 |         111.776  | 1.46765 |     0.4614 |
| train_cifar_67d45_00064 | TERMINATED | 127.0.0.1:12644 |            8 |    4 |    4 | 0.0181411   |      1 |          46.1321 | 2.31083 |     0.0981 |
| train_cifar_67d45_00065 | TERMINATED | 127.0.0.1:12428 |            4 |  256 |  128 | 0.0330339   |      1 |          57.841  | 2.26378 |     0.1518 |
| train_cifar_67d45_00066 | TERMINATED | 127.0.0.1:3968  |            8 |  128 |   16 | 0.0317081   |      1 |          45.5992 | 2.30786 |     0.1031 |
| train_cifar_67d45_00067 | TERMINATED | 127.0.0.1:8800  |            8 |   64 |   64 | 0.0691665   |      1 |          46.4963 | 2.31964 |     0.0967 |
| train_cifar_67d45_00068 | TERMINATED | 127.0.0.1:12428 |            2 |  128 |  128 | 0.0197163   |      1 |          78.1825 | 2.31514 |     0.1037 |
| train_cifar_67d45_00069 | TERMINATED | 127.0.0.1:13852 |           16 |  128 |   64 | 0.0161628   |      2 |          78.7702 | 1.54117 |     0.4443 |
| train_cifar_67d45_00070 | TERMINATED | 127.0.0.1:3556  |            8 |    8 |    8 | 0.000105775 |      1 |          46.147  | 2.30702 |     0.1009 |
| train_cifar_67d45_00071 | TERMINATED | 127.0.0.1:10060 |            4 |    8 |    4 | 0.000667863 |      2 |         112.191  | 1.62251 |     0.3845 |
| train_cifar_67d45_00072 | TERMINATED | 127.0.0.1:3612  |            4 |   32 |   16 | 0.00153857  |      2 |         109.401  | 1.45657 |     0.4853 |
| train_cifar_67d45_00073 | TERMINATED | 127.0.0.1:7888  |            4 |   64 |    4 | 0.00949178  |      1 |          58.5782 | 2.30817 |     0.1012 |
| train_cifar_67d45_00074 | TERMINATED | 127.0.0.1:8660  |            2 |   32 |  256 | 0.0674951   |      1 |          78.0406 | 2.33438 |     0.0992 |
| train_cifar_67d45_00075 | TERMINATED | 127.0.0.1:13436 |            8 |   16 |  256 | 0.0783057   |      1 |          45.6636 | 2.31096 |     0.1001 |
| train_cifar_67d45_00076 | TERMINATED | 127.0.0.1:7036  |            2 |  128 |   16 | 0.00196368  |      2 |         151.737  | 1.46339 |     0.4774 |
| train_cifar_67d45_00077 | TERMINATED | 127.0.0.1:12680 |            4 |    8 |   16 | 0.00249843  |      2 |         113.938  | 1.45827 |     0.4617 |
| train_cifar_67d45_00078 | TERMINATED | 127.0.0.1:14140 |            2 |   16 |   32 | 0.018761    |      1 |          77.8859 | 2.32846 |     0.0983 |
| train_cifar_67d45_00079 | TERMINATED | 127.0.0.1:2348  |            2 |    8 |  256 | 0.00177873  |      2 |         152.032  | 1.5823  |     0.4289 |
| train_cifar_67d45_00080 | TERMINATED | 127.0.0.1:2908  |            4 |    8 |   32 | 0.000288898 |      2 |         109.562  | 1.62298 |     0.4019 |
| train_cifar_67d45_00081 | TERMINATED | 127.0.0.1:7036  |            2 |    8 |    4 | 0.00354141  |      2 |         152.99   | 1.9352  |     0.2455 |
| train_cifar_67d45_00082 | TERMINATED | 127.0.0.1:10336 |           16 |    8 |  128 | 0.00496885  |      2 |          78.8535 | 1.43197 |     0.4919 |
| train_cifar_67d45_00083 | TERMINATED | 127.0.0.1:6172  |            2 |   32 |   64 | 0.00194701  |      2 |         152.252  | 1.56962 |     0.435  |
| train_cifar_67d45_00084 | TERMINATED | 127.0.0.1:13336 |           16 |   16 |  128 | 0.00256253  |      2 |          77.6169 | 1.3785  |     0.5008 |
| train_cifar_67d45_00085 | TERMINATED | 127.0.0.1:3448  |            4 |   16 |  256 | 0.0318124   |      1 |          55.7027 | 2.32376 |     0.0941 |
| train_cifar_67d45_00086 | TERMINATED | 127.0.0.1:3880  |            8 |    4 |  128 | 0.000121855 |      1 |          47.7121 | 2.26793 |     0.151  |
| train_cifar_67d45_00087 | TERMINATED | 127.0.0.1:13464 |            2 |   16 |   16 | 0.0307163   |      1 |          78.0027 | 2.36501 |     0.0994 |
| train_cifar_67d45_00088 | TERMINATED | 127.0.0.1:1536  |           16 |   64 |   64 | 0.000480643 |      1 |          40.9625 | 2.18846 |     0.201  |
| train_cifar_67d45_00089 | TERMINATED | 127.0.0.1:6076  |            4 |    8 |   64 | 0.0322508   |      1 |          56.9575 | 2.31536 |     0.1034 |
| train_cifar_67d45_00090 | TERMINATED | 127.0.0.1:13940 |            4 |   64 |    8 | 0.0027166   |      2 |         110.205  | 1.49607 |     0.4566 |
| train_cifar_67d45_00091 | TERMINATED | 127.0.0.1:13320 |            8 |  128 |   64 | 0.0377294   |      1 |          45.8307 | 2.31132 |     0.1025 |
| train_cifar_67d45_00092 | TERMINATED | 127.0.0.1:14276 |            4 |    8 |  128 | 0.000328398 |      2 |         112.965  | 1.45642 |     0.4654 |
| train_cifar_67d45_00093 | TERMINATED | 127.0.0.1:1700  |            2 |    4 |   32 | 0.000800404 |      2 |         151.438  | 1.64851 |     0.3835 |
| train_cifar_67d45_00094 | TERMINATED | 127.0.0.1:14324 |            8 |  256 |    8 | 0.000253253 |      1 |          45.7668 | 2.1415  |     0.2159 |
+-------------------------+------------+-----------------+--------------+------+------+-------------+--------+------------------+---------+------------+


(train_cifar pid=11188) [1,  6000] loss: 0.536
(train_cifar pid=11188) [1,  8000] loss: 0.392
(train_cifar pid=11188) [1, 10000] loss: 0.305
(train_cifar pid=11548) [1,  2000] loss: 2.160
(train_cifar pid=11548) [1,  4000] loss: 0.971
(train_cifar pid=11548) [1,  6000] loss: 0.621
(train_cifar pid=11548) [1,  8000] loss: 0.449
(train_cifar pid=11548) [1, 10000] loss: 0.357
(train_cifar pid=11548) [1, 12000] loss: 0.299
(train_cifar pid=11548) [1, 14000] loss: 0.248
(train_cifar pid=11548) [1, 16000] loss: 0.217
(train_cifar pid=11548) [1, 18000] loss: 0.190
(train_cifar pid=11548) [1, 20000] loss: 0.173
2022-08-11 19:24:07,997 ERROR tune.py:743 -- Trials did not complete: [train_cifar_67d45_00095, train_cifar_67d45_00096, train_cifar_67d45_00097, train_cifar_67d45_00098, train_cifar_67d45_00099, train_cifar_67d45_00100, train_cifar_67d45_00101, train_cifar_67d45_00102, train_cifar_67d45_00103, train_cifar_67d45_00104, train_cifar_67d45_00105, train_cifar_67d45_00106, train_cifar_67d45_00107, train_cifar_67d45_00108, train_cifar_67d45_00109, train_cifar_67d45_00110, train_cifar_67d45_00111, train_cifar_67d45_00112, train_cifar_67d45_00113, train_cifar_67d45_00114, train_cifar_67d45_00115, train_cifar_67d45_00116, train_cifar_67d45_00117, train_cifar_67d45_00118, train_cifar_67d45_00119, train_cifar_67d45_00120, train_cifar_67d45_00121, train_cifar_67d45_00122]
2022-08-11 19:24:07,997 INFO tune.py:747 -- Total run time: 4131.36 seconds (4057.01 seconds for the tuning loop).
2022-08-11 19:24:07,997 WARNING tune.py:753 -- Experiment has been interrupted, but the most recent state was saved. You can continue running this experiment by passing `resume=True` to `tune.run()`
(pid=) 2022-08-11 19:24:08,159  INFO context.py:67 -- Exec'ing worker with command: "d:\miniconda\envs\issue27646\python.exe" d:\miniconda\envs\issue27646\lib\site-packages\ray\workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=63578 --object-store-name=tcp://127.0.0.1:65094 --raylet-name=tcp://127.0.0.1:64190 --redis-address=None --storage=None --temp-dir=C:\Users\matti\AppData\Local\Temp\ray --metrics-agent-port=65337 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:65286 --redis-password=5241590000000000 --startup-token=119 --runtime-env-hash=-2001109766
(pid=) 2022-08-11 19:24:08,335  INFO context.py:67 -- Exec'ing worker with command: "d:\miniconda\envs\issue27646\python.exe" d:\miniconda\envs\issue27646\lib\site-packages\ray\workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=63578 --object-store-name=tcp://127.0.0.1:65094 --raylet-name=tcp://127.0.0.1:64190 --redis-address=None --storage=None --temp-dir=C:\Users\matti\AppData\Local\Temp\ray --metrics-agent-port=65337 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:65286 --redis-password=5241590000000000 --startup-token=120 --runtime-env-hash=-2001109766
Best trial config: {'l1': 128, 'l2': 256, 'lr': 0.0009443195440545546, 'batch_size': 4}
Best trial final validation loss: 1.2783956696420908
Best trial final validation accuracy: 0.5435
Files already downloaded and verified
Files already downloaded and verified
Best trial test set accuracy: 0.5406

mattip avatar Aug 11 '22 16:08 mattip

@richardliaw any thoughts?

mattip avatar Aug 11 '22 16:08 mattip

Using ray==2.0.0rc0 still crashes after ~124 trials. The information is saved, so restarting continues where it left off.

mattip avatar Aug 11 '22 21:08 mattip

Using ray==2.0.0rc0 still crashes after ~124 trials. The information is saved, so restarting continues where it left off.

When I tested it, on another ~100 trials after resuming, it will crash again

stephano41 avatar Aug 12 '22 04:08 stephano41

Yes, it exits unexpectedly for me too after it saves the data and prints out a summary.

mattip avatar Aug 12 '22 04:08 mattip

So I see this 2022-08-11 19:22:51,667 WARNING tune.py:682 -- Stop signal received (e.g. via SIGINT/Ctrl+C), ending Ray Tune run. This will try to checkpoint the experiment state one last time. Press CTRL+C (or send SIGINT/SIGKILL/SIGTERM) to skip. in the log that you posted @mattip ...

richardliaw avatar Aug 12 '22 05:08 richardliaw

right. The question is why? What could be sending a stop signal? I scanned the logs, all I could find was [2022-08-11 19:24:20,138 E 14084 13392] (raylet.exe) agent_manager.cc:107: The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent was unexpectedly killed or failed. See "dashboard_agent.log" for the root cause. but could not find a hint in the dashboard_agent.log, something else must have sent a signal to the raylet.exe process or killed the agent.

mattip avatar Aug 12 '22 06:08 mattip

@stephano41 when the process is halted, what is the last status report? For me it halts after about 1hr7min

== Status ==
Current time: 2022-08-11 19:22:53 (running for 01:07:37.02)
Memory usage on this node: 12.2/63.9 GiB
Using AsyncHyperBand: num_stopped=95
Bracket: Iter 2.000: -1.4806991674900054 | Iter 1.000: -2.035428786563873
Resources requested: 4.0/24 CPUs, 1.0/1 GPUs, 0.0/35.49 GiB heap, 0.0/17.75 GiB objects
Current best trial: 67d45_00008 with loss=1.2783956696420908 and parameters={'l1': 128, 'l2': 256, 'lr': 0.0009443195440545546, 'batch_size': 4}
Result logdir: d:\e\GRF_outputs2\train_cifar_2022-08-08_14-32-23
Number of trials: 123/250 (26 PENDING, 2 RUNNING, 95 TERMINATED
== Status ==
Current time: 2022-08-12 15:18:57 (running for 01:06:17.48)
Memory usage on this node: 12.9/63.9 GiB
Using AsyncHyperBand: num_stopped=96
Bracket: Iter 2.000: -1.4813302184641362 | Iter 1.000: -1.8759344378352165
Resources requested: 20.0/24 CPUs, 1.0/1 GPUs, 0.0/34.76 GiB heap, 0.0/17.38 GiB objects
Current best trial: adbad_00024 with loss=1.3153812784522771 and parameters={'l1': 256, 'l2': 8, 'lr': 0.0009968174381239548, 'batch_size': 4}
Result logdir: d:\e\GRF_outputs2\train_cifar_2022-08-08_14-32-23
Number of trials: 123/250 (25 PENDING, 2 RUNNING, 96 TERMINATED)


(train_cifar pid=8768) Files already downloaded and verified
2022-08-12 15:18:58,410 WARNING tune.py:686 -- Stop signal received (e.g. via SIGINT/Ctrl+C), ending Ray Tune run. This will try to checkpoint the experiment state one last time. Press CTRL+C (or send SIGINT/SIGKILL/SIGTERM) to skip.

mattip avatar Aug 12 '22 12:08 mattip

I see an unusual message from the runtime_env module at the end of the dashboard_agent.log at the time the process is shutting down. It seems the TUNE_ORIG_WORKING_DIR is used in both the script (for the data directory) and the RayTrialExecutor in tune/execution/ray_trial_executor.py

2022-08-12 16:48:35,778	INFO runtime_env_agent.py:127 -- Unused runtime env {"env_vars": {"TUNE_ORIG_WORKING_DIR": "d:\\rundir"}}.
2022-08-12 16:48:35,778	INFO runtime_env_agent.py:233 -- Runtime env {"env_vars": {"TUNE_ORIG_WORKING_DIR": "d:\\rundir"}} deleted.

mattip avatar Aug 12 '22 14:08 mattip

@stephano41 when the process is halted, what is the last status report? For me it halts after about 1hr7min

== Status ==
Current time: 2022-08-11 19:22:53 (running for 01:07:37.02)
Memory usage on this node: 12.2/63.9 GiB
Using AsyncHyperBand: num_stopped=95
Bracket: Iter 2.000: -1.4806991674900054 | Iter 1.000: -2.035428786563873
Resources requested: 4.0/24 CPUs, 1.0/1 GPUs, 0.0/35.49 GiB heap, 0.0/17.75 GiB objects
Current best trial: 67d45_00008 with loss=1.2783956696420908 and parameters={'l1': 128, 'l2': 256, 'lr': 0.0009443195440545546, 'batch_size': 4}
Result logdir: d:\e\GRF_outputs2\train_cifar_2022-08-08_14-32-23
Number of trials: 123/250 (26 PENDING, 2 RUNNING, 95 TERMINATED
== Status ==
Current time: 2022-08-12 15:18:57 (running for 01:06:17.48)
Memory usage on this node: 12.9/63.9 GiB
Using AsyncHyperBand: num_stopped=96
Bracket: Iter 2.000: -1.4813302184641362 | Iter 1.000: -1.8759344378352165
Resources requested: 20.0/24 CPUs, 1.0/1 GPUs, 0.0/34.76 GiB heap, 0.0/17.38 GiB objects
Current best trial: adbad_00024 with loss=1.3153812784522771 and parameters={'l1': 256, 'l2': 8, 'lr': 0.0009968174381239548, 'batch_size': 4}
Result logdir: d:\e\GRF_outputs2\train_cifar_2022-08-08_14-32-23
Number of trials: 123/250 (25 PENDING, 2 RUNNING, 96 TERMINATED)


(train_cifar pid=8768) Files already downloaded and verified
2022-08-12 15:18:58,410 WARNING tune.py:686 -- Stop signal received (e.g. via SIGINT/Ctrl+C), ending Ray Tune run. This will try to checkpoint the experiment state one last time. Press CTRL+C (or send SIGINT/SIGKILL/SIGTERM) to skip.

I don't have one ready to copy and paste into this, but can confirm I a similar status report message before it exits, just different directory and names. I usually don't get any stack trace, however I do sometimes get the stack trace mentioned above:

2022-08-08 17:44:41,788 WARNING worker.py:1404 -- The log monitor on node DESKTOP-N18AJ6S failed with the following error:
OSError: [WinError 87] The parameter is incorrect

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\Users\Steven\anaconda3\envs\GRF_hip_outcomes\lib\site-packages\ray\_private\log_monitor.py", line 451, in <module>
    log_monitor.run()
  File "C:\Users\Steven\anaconda3\envs\GRF_hip_outcomes\lib\site-packages\ray\_private\log_monitor.py", line 376, in run
    self.open_closed_files()
  File "C:\Users\Steven\anaconda3\envs\GRF_hip_outcomes\lib\site-packages\ray\_private\log_monitor.py", line 218, in open_closed_files
    self.close_all_files()
  File "C:\Users\Steven\anaconda3\envs\GRF_hip_outcomes\lib\site-packages\ray\_private\log_monitor.py", line 130, in close_all_files
    os.kill(file_info.worker_pid, 0)
SystemError: <built-in function kill> returned a result with an error set

stephano41 avatar Aug 13 '22 07:08 stephano41

In the first script, after setting gpus_per_trial=0 it runs all 250 trials without failing in 5hr40min. I wonder if there is a problem with the torchvision GPU code causing a failure.

mattip avatar Aug 13 '22 19:08 mattip

Updating pytorch to 1.13 did not solve the problem

conda install -c pytorch pytorch==1.12 torchvision=0.13
REM fix the "paging file too small problem"
del  d:\miniconda\envs\issue27646\lib\site-packages\torch\lib\*.dll_bak
python d:\temp\fixNvPe.py --input  d:\miniconda\envs\issue27646\lib\site-packages\torch\lib\*.dll

Carefully watching taskmanager at the time of the failure, I see a spike in GPU % utilization. It is pretty steady at 20% but then hits 100%. I don't know if that is the cause, or if that is a side-effect of the failure.

mattip avatar Aug 15 '22 16:08 mattip

Hi, I have Windows and I am using Anaconda and would like to install ray to my current environment but it keeps giving me this error:

ERROR: Could not find a version that satisfies the requirement ray (from versions: none)
ERROR: No matching distribution found for ray

I have created a new environment and tried to install it again but it didn't work. Could you please help me with this one ?

ElifCerenGokYildirim avatar Aug 23 '22 10:08 ElifCerenGokYildirim

Not connected to this issue, but note how I installed ray above. You must use pip, there is no ray conda package yet.

mattip avatar Aug 23 '22 10:08 mattip

@mattip sorry couldn't find an appropriate issue for this one but followed your solution and it worked, thank you so much! =)

ElifCerenGokYildirim avatar Aug 23 '22 10:08 ElifCerenGokYildirim

Perhaps my use of "must" is a bit extreme, there are conda packages just none yet for python3.10+ and ray 1.13+

mattip avatar Aug 23 '22 10:08 mattip

I'm encountering this issue when using ray-1.13.0 through autogluon-0.5.2 on Windows / Python 3.9.3, everything installed with pip (no conda usage):

WARNING worker.py:1404 -- The log monitor on node DESKTOP-X4SMI53 failed with the following error:
OSError: [WinError 87] The parameter is incorrect

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "X:\gt\.pyenv\lib\site-packages\ray\_private\log_monitor.py", line 451, in <module>
    log_monitor.run()
  File "X:\gt\.pyenv\lib\site-packages\ray\_private\log_monitor.py", line 376, in run
    self.open_closed_files()
  File "X:\gt\.pyenv\lib\site-packages\ray\_private\log_monitor.py", line 218, in open_closed_files
    self.close_all_files()
  File "X:\gt\.pyenv\lib\site-packages\ray\_private\log_monitor.py", line 130, in close_all_files
    os.kill(file_info.worker_pid, 0)
SystemError: <built-in function kill> returned a result with an error set

Traceback (most recent call last):
  ...
  File "X:\gt\.pyenv\lib\site-packages\autogluon\core\models\ensemble\fold_fitting_strategy.py", line 421, in after_all_folds_scheduled
    finished, unfinished = self.ray.wait(unfinished, num_returns=1)
  File "X:\gt\.pyenv\lib\site-packages\ray\_private\client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "X:\gt\.pyenv\lib\site-packages\ray\worker.py", line 2015, in wait
    ready_ids, remaining_ids = worker.core_worker.wait(
  File "python\ray\_raylet.pyx", line 1403, in ray._raylet.CoreWorker.wait
  File "python\ray\_raylet.pyx", line 169, in ray._raylet.check_status
KeyboardInterrupt
^C
(.pyenv) X:\gt>

I did not press Ctrl+C as seems to be suggested by the above stack trace

It only seems to happen if there are other Python processes running, not started or related to ray. Windows is known to reuse pids, so could it be some sort of race condition, a ray process exits, an unrelated Python process starts (for example a short lived VS Code pylint helper process), gets assigned the same pid as the former ray process and the ray orechestrator somehow gets confused.

If I make sure no other Python processes are running except the autogluon one which uses ray the issue does not seem to happen (or maybe it happens much less frequently).

There is also a Python bug which sounds like it might be related, about os.kill failing with the same WinError 87 in relation to Ctrl+C: https://bugs.python.org/issue42962

2-5 avatar Nov 03 '22 14:11 2-5

The exception happens here https://github.com/ray-project/ray/blob/73caaf926e4e508a2b2adcf76d719b2e46bf67f2/python/ray/_private/log_monitor.py#L506-L513

While this may be OK on linux, I am pretty sure it is not OK on windows.

mattip avatar Nov 15 '22 16:11 mattip

How about using psutil, which is already vendored by ray, to determine if a pid is alive?

I looked it up now and I see a number of possibilities.

  1. psutil.pid_exists(pid) - returns False if process is not alive

  2. psutil.Process(pid).status() - seems to raise psutil.NoSuchProcess if the process is not alive

  3. psutil.Process(pid).is_running() - the docs have a warning regarding this method:

Note: this will return True also if the process is a zombie (p.status() == psutil.STATUS_ZOMBIE).

I did some tests with psutil.pid_exists(pid) and it seems to work as intended. It eventually calls native Windows APIs from C:

https://github.com/giampaolo/psutil/blob/446f4bfed1463f9016cf6bdd36f6b30700bec0ad/psutil/arch/windows/process_utils.c#L163-L189

2-5 avatar Nov 15 '22 20:11 2-5

For a little more context: see python issue 14480 and python issue 14484. It is not OK to call os.kill() on windows except to terminate a process. Replacing this with psutil sounds like a good idea.

mattip avatar Nov 21 '22 13:11 mattip

The hard part here will be writing a test. Does anyone know if there is an existing test that hits this code path?

mattip avatar Nov 21 '22 14:11 mattip

An option would be keeping os.kill on Linux/Mac and using psutil just on Windows. This way there is no risk of breaking Linux/Mac.

2-5 avatar Nov 21 '22 18:11 2-5

I changed the os.kill to psutil in #30643

mattip avatar Nov 24 '22 09:11 mattip

This minimal workaround would fix Windows without changing behavior on Linux/Mac:

# /ray/_private/log_monitor.py

def is_proc_alive(pid):
    if sys.platform == "win32":
        import psutil
        try:
            return psutil.Process(pid).is_running()
        except psutil.NoSuchProcess:
            return False
    else:
        try:
            os.kill(pid, 0)
            return True
        except OSError:
            # If OSError is raised, the process is not alive.
            return False

2-5 avatar Jan 13 '23 12:01 2-5

I tried doing that in #30643, but it has unintended consequences. I don't understand how the change is influencing other tests.

mattip avatar Jan 16 '23 08:01 mattip