ray icon indicating copy to clipboard operation
ray copied to clipboard

[Tune] Can never run two tune scripts at all - if I run two ray tune scripts, the first script will be using the code of second script and abandon its own code (With detailed and easy reproduction; with bug cause analysis)

Open fzyzcjy opened this issue 2 years ago • 7 comments

What happened + What you expected to happen

Consider the scenario: I have a.py containing a ray tuning script, and b.py for another. Then, if I run python a.py, and then python b.py (i.e. run two experiments), then the python a.py will be soon running code from b.py instead of its own a.py. This makes the great ray tune completely useless :(

More details can be seen in the reproduction script.

The cause of bug is that, the trainable object is fetched from the GCS store with the same key for the two experiments, even though it should be very different key. I can provide more details if needed.

Versions / Dependencies

latest

Reproduction script

Create two files:

a.py

import time
from typing import Dict, Any, Optional, Union

from ray import tune
from ray.tune import result


class MyTrainable(tune.Trainable):
    def setup(self, config: Dict) -> None:
        print(f'MyTrainable.setup config={config}')

    def step(self) -> Dict[str, Any]:
        print('MyTrainable.step start, I am AAAAAAAAAAAAAA!')
        time.sleep(3)
        return {'dummy_accuracy': 111, result.DONE: True}

    def save_checkpoint(self, checkpoint_dir: str) -> Optional[Union[str, Dict]]:
        pass

    def load_checkpoint(self, checkpoint: Union[Dict, str]):
        pass


tuner = tune.Tuner(
    MyTrainable,
    tune_config=tune.TuneConfig(
        num_samples=1000,
        max_concurrent_trials=1,
    ),
)

print('call tuner.fit')
tuner.fit()

b.py (same as a.py, but change the print)

import time
from typing import Dict, Any, Optional, Union

from ray import tune
from ray.tune import result


class MyTrainable(tune.Trainable):
    def setup(self, config: Dict) -> None:
        print(f'MyTrainable.setup config={config}')

    def step(self) -> Dict[str, Any]:
        print('MyTrainable.step start, I am BBBBBBBBBBBBB!')
        time.sleep(3)
        return {'dummy_accuracy': 222, result.DONE: True}

    def save_checkpoint(self, checkpoint_dir: str) -> Optional[Union[str, Dict]]:
        pass

    def load_checkpoint(self, checkpoint: Union[Dict, str]):
        pass


tuner = tune.Tuner(
    MyTrainable,
    tune_config=tune.TuneConfig(
        num_samples=1000,
        max_concurrent_trials=1,
    ),
)

print('call tuner.fit')
tuner.fit()

Now open terminal 1 and run python a.py. Wait for some time (e.g. let it execute a few trials), and open terminal 2 and run python b.py. Then wait for both terminals for some time.

Then you can see that, in the terminal of python a.py, it outputs "I am B" even if it is a.py and should only say "I am A"!

Full log:

terminal A (a.py)

$ python a.py
call tuner.fit
2022-12-27 18:38:29,566 INFO worker.py:1342 -- Connecting to existing Ray cluster at address: 10.20.28.72:6379...
2022-12-27 18:38:29,579 INFO worker.py:1519 -- Connected to Ray cluster. View the dashboard at 127.0.0.1:8265 
(raylet) [2022-12-27 18:38:30,566 E 638 802] (raylet) file_system_monitor.cc:105: /tmp/ray/session_2022-12-26_22-43-48_950666_32692 is over 95% full, available space: 9133031424; capacity: 502930677760. Object creation will fail if spilling is required.
2022-12-27 18:38:30,806 WARNING tune.py:689 -- Tune detects GPUs, but no trials are using GPUs. To enable trials to use GPUs, wrap `train_func` with `tune.with_resources(train_func, resources_per_trial={'gpu': 1})` which allows Tune to expose 1 GPU to each trial. For Ray AIR Trainers, you can specify GPU resources through `ScalingConfig(use_gpu=True)`. You can also override `Trainable.default_resource_request` if using the Trainable API.
== Status ==
Current time: 2022-12-27 18:38:34 (running for 00:00:03.28)
Memory usage on this node: 28.9/125.8 GiB 
Using FIFO scheduling algorithm.
Resources requested: 1.0/32 CPUs, 0/4 GPUs, 0.0/68.54 GiB heap, 0.0/33.36 GiB objects (0.0/1.0 accelerator_type:GTX)
Result logdir: /home/jychen/ray_results/MyTrainable_2022-12-27_18-38-29
Number of trials: 1/1000 (1 RUNNING)
+-------------------------+----------+-------------------+
| Trial name              | status   | loc               |
|-------------------------+----------+-------------------|
| MyTrainable_9a447_00000 | RUNNING  | 10.20.28.72:11011 |
+-------------------------+----------+-------------------+


(MyTrainable pid=11011) MyTrainable.setup config={}
(MyTrainable pid=11011) MyTrainable.step start, I am AAAAAAAAAAAAAA!
Result for MyTrainable_9a447_00000:
  date: 2022-12-27_18-38-37
  done: true
  dummy_accuracy: 111
  experiment_id: e11d61d58c034c40bfae3f72ea53b9c7
  hostname: rail-PR4764GW
  iterations_since_restore: 1
  node_ip: 10.20.28.72
  pid: 11011
  time_since_restore: 3.0146656036376953
  time_this_iter_s: 3.0146656036376953
  time_total_s: 3.0146656036376953
  timestamp: 1672137517
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: 9a447_00000
  warmup_time: 0.004631757736206055
  
== Status ==
Current time: 2022-12-27 18:38:37 (running for 00:00:06.30)
Memory usage on this node: 28.9/125.8 GiB 
Using FIFO scheduling algorithm.
Resources requested: 1.0/32 CPUs, 0/4 GPUs, 0.0/68.54 GiB heap, 0.0/33.36 GiB objects (0.0/1.0 accelerator_type:GTX)
Result logdir: /home/jychen/ray_results/MyTrainable_2022-12-27_18-38-29
Number of trials: 1/1000 (1 RUNNING)
+-------------------------+----------+-------------------+--------+------------------+------------------+
| Trial name              | status   | loc               |   iter |   total time (s) |   dummy_accuracy |
|-------------------------+----------+-------------------+--------+------------------+------------------|
| MyTrainable_9a447_00000 | RUNNING  | 10.20.28.72:11011 |      1 |          3.01467 |              111 |
+-------------------------+----------+-------------------+--------+------------------+------------------+


(MyTrainable pid=11115) MyTrainable.setup config={}
(MyTrainable pid=11115) MyTrainable.step start, I am AAAAAAAAAAAAAA!
(raylet) [2022-12-27 18:38:40,576 E 638 802] (raylet) file_system_monitor.cc:105: /tmp/ray/session_2022-12-26_22-43-48_950666_32692 is over 95% full, available space: 9132912640; capacity: 502930677760. Object creation will fail if spilling is required.
== Status ==
Current time: 2022-12-27 18:38:42 (running for 00:00:11.34)
Memory usage on this node: 28.9/125.8 GiB 
Using FIFO scheduling algorithm.
Resources requested: 1.0/32 CPUs, 0/4 GPUs, 0.0/68.54 GiB heap, 0.0/33.36 GiB objects (0.0/1.0 accelerator_type:GTX)
Result logdir: /home/jychen/ray_results/MyTrainable_2022-12-27_18-38-29
Number of trials: 2/1000 (1 RUNNING, 1 TERMINATED)
+-------------------------+------------+-------------------+--------+------------------+------------------+
| Trial name              | status     | loc               |   iter |   total time (s) |   dummy_accuracy |
|-------------------------+------------+-------------------+--------+------------------+------------------|
| MyTrainable_9a447_00001 | RUNNING    | 10.20.28.72:11115 |        |                  |                  |
| MyTrainable_9a447_00000 | TERMINATED | 10.20.28.72:11011 |      1 |          3.01467 |              111 |
+-------------------------+------------+-------------------+--------+------------------+------------------+


Result for MyTrainable_9a447_00001:
  date: 2022-12-27_18-38-42
  done: true
  dummy_accuracy: 111
  experiment_id: dd7915ec42c74a6f847997ee58f8a97f
  hostname: rail-PR4764GW
  iterations_since_restore: 1
  node_ip: 10.20.28.72
  pid: 11115
  time_since_restore: 3.0108096599578857
  time_this_iter_s: 3.0108096599578857
  time_total_s: 3.0108096599578857
  timestamp: 1672137522
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: 9a447_00001
  warmup_time: 0.004525423049926758
  
(MyTrainable pid=11238) MyTrainable.setup config={}
(MyTrainable pid=11238) MyTrainable.step start, I am AAAAAAAAAAAAAA!
== Status ==
Current time: 2022-12-27 18:38:47 (running for 00:00:16.99)
Memory usage on this node: 29.2/125.8 GiB 
Using FIFO scheduling algorithm.
Resources requested: 1.0/32 CPUs, 0/4 GPUs, 0.0/68.54 GiB heap, 0.0/33.36 GiB objects (0.0/1.0 accelerator_type:GTX)
Result logdir: /home/jychen/ray_results/MyTrainable_2022-12-27_18-38-29
Number of trials: 3/1000 (1 RUNNING, 2 TERMINATED)
+-------------------------+------------+-------------------+--------+------------------+------------------+
| Trial name              | status     | loc               |   iter |   total time (s) |   dummy_accuracy |
|-------------------------+------------+-------------------+--------+------------------+------------------|
| MyTrainable_9a447_00002 | RUNNING    | 10.20.28.72:11238 |        |                  |                  |
| MyTrainable_9a447_00000 | TERMINATED | 10.20.28.72:11011 |      1 |          3.01467 |              111 |
| MyTrainable_9a447_00001 | TERMINATED | 10.20.28.72:11115 |      1 |          3.01081 |              111 |
+-------------------------+------------+-------------------+--------+------------------+------------------+


Result for MyTrainable_9a447_00002:
  date: 2022-12-27_18-38-48
  done: true
  dummy_accuracy: 111
  experiment_id: 97bcf866b61d49a88b298ed30cbee6b2
  hostname: rail-PR4764GW
  iterations_since_restore: 1
  node_ip: 10.20.28.72
  pid: 11238
  time_since_restore: 3.0118117332458496
  time_this_iter_s: 3.0118117332458496
  time_total_s: 3.0118117332458496
  timestamp: 1672137528
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: 9a447_00002
  warmup_time: 0.004528999328613281
  
(MyTrainable pid=11481) MyTrainable.setup config={}
(MyTrainable pid=11481) MyTrainable.step start, I am BBBBBBBBBBBBB!
(raylet) [2022-12-27 18:38:50,585 E 638 802] (raylet) file_system_monitor.cc:105: /tmp/ray/session_2022-12-26_22-43-48_950666_32692 is over 95% full, available space: 9132666880; capacity: 502930677760. Object creation will fail if spilling is required.
== Status ==
Current time: 2022-12-27 18:38:53 (running for 00:00:22.37)
Memory usage on this node: 29.2/125.8 GiB 
Using FIFO scheduling algorithm.
Resources requested: 1.0/32 CPUs, 0/4 GPUs, 0.0/68.54 GiB heap, 0.0/33.36 GiB objects (0.0/1.0 accelerator_type:GTX)
Result logdir: /home/jychen/ray_results/MyTrainable_2022-12-27_18-38-29
Number of trials: 4/1000 (1 RUNNING, 3 TERMINATED)
+-------------------------+------------+-------------------+--------+------------------+------------------+
| Trial name              | status     | loc               |   iter |   total time (s) |   dummy_accuracy |
|-------------------------+------------+-------------------+--------+------------------+------------------|
| MyTrainable_9a447_00003 | RUNNING    | 10.20.28.72:11481 |        |                  |                  |
| MyTrainable_9a447_00000 | TERMINATED | 10.20.28.72:11011 |      1 |          3.01467 |              111 |
| MyTrainable_9a447_00001 | TERMINATED | 10.20.28.72:11115 |      1 |          3.01081 |              111 |
| MyTrainable_9a447_00002 | TERMINATED | 10.20.28.72:11238 |      1 |          3.01181 |              111 |
+-------------------------+------------+-------------------+--------+------------------+------------------+


Result for MyTrainable_9a447_00003:
  date: 2022-12-27_18-38-53
  done: true
  dummy_accuracy: 222
  experiment_id: 5460e197d11e439db83ddd7bc31ddda9
  hostname: rail-PR4764GW
  iterations_since_restore: 1
  node_ip: 10.20.28.72
  pid: 11481
  time_since_restore: 3.01511287689209
  time_this_iter_s: 3.01511287689209
  time_total_s: 3.01511287689209
  timestamp: 1672137533
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: 9a447_00003
  warmup_time: 0.0041506290435791016
  
(MyTrainable pid=11656) MyTrainable.setup config={}
(MyTrainable pid=11656) MyTrainable.step start, I am BBBBBBBBBBBBB!
== Status ==
Current time: 2022-12-27 18:38:58 (running for 00:00:27.53)
Memory usage on this node: 29.2/125.8 GiB 
Using FIFO scheduling algorithm.
Resources requested: 1.0/32 CPUs, 0/4 GPUs, 0.0/68.54 GiB heap, 0.0/33.36 GiB objects (0.0/1.0 accelerator_type:GTX)
Result logdir: /home/jychen/ray_results/MyTrainable_2022-12-27_18-38-29
Number of trials: 5/1000 (1 RUNNING, 4 TERMINATED)
+-------------------------+------------+-------------------+--------+------------------+------------------+
| Trial name              | status     | loc               |   iter |   total time (s) |   dummy_accuracy |
|-------------------------+------------+-------------------+--------+------------------+------------------|
| MyTrainable_9a447_00004 | RUNNING    | 10.20.28.72:11656 |        |                  |                  |
| MyTrainable_9a447_00000 | TERMINATED | 10.20.28.72:11011 |      1 |          3.01467 |              111 |
| MyTrainable_9a447_00001 | TERMINATED | 10.20.28.72:11115 |      1 |          3.01081 |              111 |
| MyTrainable_9a447_00002 | TERMINATED | 10.20.28.72:11238 |      1 |          3.01181 |              111 |
| MyTrainable_9a447_00003 | TERMINATED | 10.20.28.72:11481 |      1 |          3.01511 |              222 |
+-------------------------+------------+-------------------+--------+------------------+------------------+


Result for MyTrainable_9a447_00004:
  date: 2022-12-27_18-38-58
  done: true
  dummy_accuracy: 222
  experiment_id: 08e9b5d945484f4fa02d3468ee439ce6
  hostname: rail-PR4764GW
  iterations_since_restore: 1
  node_ip: 10.20.28.72
  pid: 11656
  time_since_restore: 3.0123112201690674
  time_this_iter_s: 3.0123112201690674
  time_total_s: 3.0123112201690674
  timestamp: 1672137538
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: 9a447_00004
  warmup_time: 0.004000186920166016
  
(raylet) [2022-12-27 18:39:00,594 E 638 802] (raylet) file_system_monitor.cc:105: /tmp/ray/session_2022-12-26_22-43-48_950666_32692 is over 95% full, available space: 9132412928; capacity: 502930677760. Object creation will fail if spilling is required.
(MyTrainable pid=11801) MyTrainable.setup config={}
(MyTrainable pid=11801) MyTrainable.step start, I am BBBBBBBBBBBBB!
== Status ==
Current time: 2022-12-27 18:39:03 (running for 00:00:32.88)
Memory usage on this node: 29.3/125.8 GiB 
Using FIFO scheduling algorithm.
Resources requested: 1.0/32 CPUs, 0/4 GPUs, 0.0/68.54 GiB heap, 0.0/33.36 GiB objects (0.0/1.0 accelerator_type:GTX)
Result logdir: /home/jychen/ray_results/MyTrainable_2022-12-27_18-38-29
Number of trials: 6/1000 (1 RUNNING, 5 TERMINATED)
+-------------------------+------------+-------------------+--------+------------------+------------------+
| Trial name              | status     | loc               |   iter |   total time (s) |   dummy_accuracy |
|-------------------------+------------+-------------------+--------+------------------+------------------|
| MyTrainable_9a447_00005 | RUNNING    | 10.20.28.72:11801 |        |                  |                  |
| MyTrainable_9a447_00000 | TERMINATED | 10.20.28.72:11011 |      1 |          3.01467 |              111 |
| MyTrainable_9a447_00001 | TERMINATED | 10.20.28.72:11115 |      1 |          3.01081 |              111 |
| MyTrainable_9a447_00002 | TERMINATED | 10.20.28.72:11238 |      1 |          3.01181 |              111 |
| MyTrainable_9a447_00003 | TERMINATED | 10.20.28.72:11481 |      1 |          3.01511 |              222 |
| MyTrainable_9a447_00004 | TERMINATED | 10.20.28.72:11656 |      1 |          3.01231 |              222 |
+-------------------------+------------+-------------------+--------+------------------+------------------+


Result for MyTrainable_9a447_00005:
  date: 2022-12-27_18-39-03
  done: true
  dummy_accuracy: 222
  experiment_id: 1a8670eab13741a3abf6c64ada577bec
  hostname: rail-PR4764GW
  iterations_since_restore: 1
  node_ip: 10.20.28.72
  pid: 11801
  time_since_restore: 3.0144355297088623
  time_this_iter_s: 3.0144355297088623
  time_total_s: 3.0144355297088623
  timestamp: 1672137543
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: 9a447_00005
  warmup_time: 0.004655361175537109
  
(MyTrainable pid=12000) MyTrainable.setup config={}
(MyTrainable pid=12000) MyTrainable.step start, I am BBBBBBBBBBBBB!
== Status ==
Current time: 2022-12-27 18:39:08 (running for 00:00:38.05)
Memory usage on this node: 29.3/125.8 GiB 
Using FIFO scheduling algorithm.
Resources requested: 1.0/32 CPUs, 0/4 GPUs, 0.0/68.54 GiB heap, 0.0/33.36 GiB objects (0.0/1.0 accelerator_type:GTX)
Result logdir: /home/jychen/ray_results/MyTrainable_2022-12-27_18-38-29
Number of trials: 7/1000 (1 RUNNING, 6 TERMINATED)
+-------------------------+------------+-------------------+--------+------------------+------------------+
| Trial name              | status     | loc               |   iter |   total time (s) |   dummy_accuracy |
|-------------------------+------------+-------------------+--------+------------------+------------------|
| MyTrainable_9a447_00006 | RUNNING    | 10.20.28.72:12000 |        |                  |                  |
| MyTrainable_9a447_00000 | TERMINATED | 10.20.28.72:11011 |      1 |          3.01467 |              111 |
| MyTrainable_9a447_00001 | TERMINATED | 10.20.28.72:11115 |      1 |          3.01081 |              111 |
| MyTrainable_9a447_00002 | TERMINATED | 10.20.28.72:11238 |      1 |          3.01181 |              111 |
| MyTrainable_9a447_00003 | TERMINATED | 10.20.28.72:11481 |      1 |          3.01511 |              222 |
| MyTrainable_9a447_00004 | TERMINATED | 10.20.28.72:11656 |      1 |          3.01231 |              222 |
| MyTrainable_9a447_00005 | TERMINATED | 10.20.28.72:11801 |      1 |          3.01444 |              222 |
+-------------------------+------------+-------------------+--------+------------------+------------------+


Result for MyTrainable_9a447_00006:
  date: 2022-12-27_18-39-09
  done: true
  dummy_accuracy: 222
  experiment_id: 118ad0ed34314e04a1e44cd07bb2ecac
  hostname: rail-PR4764GW
  iterations_since_restore: 1
  node_ip: 10.20.28.72
  pid: 12000
  time_since_restore: 3.015101909637451
  time_this_iter_s: 3.015101909637451
  time_total_s: 3.015101909637451
  timestamp: 1672137549
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: 9a447_00006
  warmup_time: 0.00490117073059082
  
(raylet) [2022-12-27 18:39:10,603 E 638 802] (raylet) file_system_monitor.cc:105: /tmp/ray/session_2022-12-26_22-43-48_950666_32692 is over 95% full, available space: 9132158976; capacity: 502930677760. Object creation will fail if spilling is required.
(MyTrainable pid=12144) MyTrainable.setup config={}
(MyTrainable pid=12144) MyTrainable.step start, I am BBBBBBBBBBBBB!
^C2022-12-27 18:39:14,048       WARNING tune.py:705 -- Stop signal received (e.g. via SIGINT/Ctrl+C), ending Ray Tune run. This will try to checkpoint the experiment state one last time. Press CTRL+C (or send SIGINT/SIGKILL/SIGTERM) to skip. 
== Status ==
Current time: 2022-12-27 18:39:14 (running for 00:00:43.25)
Memory usage on this node: 29.3/125.8 GiB 
Using FIFO scheduling algorithm.
Resources requested: 1.0/32 CPUs, 0/4 GPUs, 0.0/68.54 GiB heap, 0.0/33.36 GiB objects (0.0/1.0 accelerator_type:GTX)
Result logdir: /home/jychen/ray_results/MyTrainable_2022-12-27_18-38-29
Number of trials: 8/1000 (1 RUNNING, 7 TERMINATED)
+-------------------------+------------+-------------------+--------+------------------+------------------+
| Trial name              | status     | loc               |   iter |   total time (s) |   dummy_accuracy |
|-------------------------+------------+-------------------+--------+------------------+------------------|
| MyTrainable_9a447_00007 | RUNNING    | 10.20.28.72:12144 |        |                  |                  |
| MyTrainable_9a447_00000 | TERMINATED | 10.20.28.72:11011 |      1 |          3.01467 |              111 |
| MyTrainable_9a447_00001 | TERMINATED | 10.20.28.72:11115 |      1 |          3.01081 |              111 |
| MyTrainable_9a447_00002 | TERMINATED | 10.20.28.72:11238 |      1 |          3.01181 |              111 |
| MyTrainable_9a447_00003 | TERMINATED | 10.20.28.72:11481 |      1 |          3.01511 |              222 |
| MyTrainable_9a447_00004 | TERMINATED | 10.20.28.72:11656 |      1 |          3.01231 |              222 |
| MyTrainable_9a447_00005 | TERMINATED | 10.20.28.72:11801 |      1 |          3.01444 |              222 |
| MyTrainable_9a447_00006 | TERMINATED | 10.20.28.72:12000 |      1 |          3.0151  |              222 |
+-------------------------+------------+-------------------+--------+------------------+------------------+


== Status ==
Current time: 2022-12-27 18:39:14 (running for 00:00:43.26)
Memory usage on this node: 29.3/125.8 GiB 
Using FIFO scheduling algorithm.
Resources requested: 1.0/32 CPUs, 0/4 GPUs, 0.0/68.54 GiB heap, 0.0/33.36 GiB objects (0.0/1.0 accelerator_type:GTX)
Result logdir: /home/jychen/ray_results/MyTrainable_2022-12-27_18-38-29
Number of trials: 8/1000 (1 RUNNING, 7 TERMINATED)
+-------------------------+------------+-------------------+--------+------------------+------------------+
| Trial name              | status     | loc               |   iter |   total time (s) |   dummy_accuracy |
|-------------------------+------------+-------------------+--------+------------------+------------------|
| MyTrainable_9a447_00007 | RUNNING    | 10.20.28.72:12144 |        |                  |                  |
| MyTrainable_9a447_00000 | TERMINATED | 10.20.28.72:11011 |      1 |          3.01467 |              111 |
| MyTrainable_9a447_00001 | TERMINATED | 10.20.28.72:11115 |      1 |          3.01081 |              111 |
| MyTrainable_9a447_00002 | TERMINATED | 10.20.28.72:11238 |      1 |          3.01181 |              111 |
| MyTrainable_9a447_00003 | TERMINATED | 10.20.28.72:11481 |      1 |          3.01511 |              222 |
| MyTrainable_9a447_00004 | TERMINATED | 10.20.28.72:11656 |      1 |          3.01231 |              222 |
| MyTrainable_9a447_00005 | TERMINATED | 10.20.28.72:11801 |      1 |          3.01444 |              222 |
| MyTrainable_9a447_00006 | TERMINATED | 10.20.28.72:12000 |      1 |          3.0151  |              222 |
+-------------------------+------------+-------------------+--------+------------------+------------------+


2022-12-27 18:39:14,398 ERROR tune.py:773 -- Trials did not complete: [MyTrainable_9a447_00007]
2022-12-27 18:39:14,398 INFO tune.py:777 -- Total run time: 44.79 seconds (43.25 seconds for the tuning loop).
2022-12-27 18:39:14,398 WARNING tune.py:783 -- Experiment has been interrupted, but the most recent state was saved. You can continue running this experiment by passing `resume=True` to `tune.run()`

terminal 2 (b.py)

$ python b.py
call tuner.fit
2022-12-27 18:38:42,786 INFO worker.py:1342 -- Connecting to existing Ray cluster at address: 10.20.28.72:6379...
2022-12-27 18:38:42,798 INFO worker.py:1519 -- Connected to Ray cluster. View the dashboard at 127.0.0.1:8265 
2022-12-27 18:38:44,212 WARNING tune.py:689 -- Tune detects GPUs, but no trials are using GPUs. To enable trials to use GPUs, wrap `train_func` with `tune.with_resources(train_func, resources_per_trial={'gpu': 1})` which allows Tune to expose 1 GPU to each trial. For Ray AIR Trainers, you can specify GPU resources through `ScalingConfig(use_gpu=True)`. You can also override `Trainable.default_resource_request` if using the Trainable API.
== Status ==
Current time: 2022-12-27 18:38:47 (running for 00:00:03.21)
Memory usage on this node: 29.2/125.8 GiB 
Using FIFO scheduling algorithm.
Resources requested: 1.0/32 CPUs, 0/4 GPUs, 0.0/68.54 GiB heap, 0.0/33.36 GiB objects (0.0/1.0 accelerator_type:GTX)
Result logdir: /home/jychen/ray_results/MyTrainable_2022-12-27_18-38-42
Number of trials: 1/1000 (1 RUNNING)
+-------------------------+----------+-------------------+
| Trial name              | status   | loc               |
|-------------------------+----------+-------------------|
| MyTrainable_a224e_00000 | RUNNING  | 10.20.28.72:11404 |
+-------------------------+----------+-------------------+


(MyTrainable pid=11404) MyTrainable.setup config={}
(MyTrainable pid=11404) MyTrainable.step start, I am BBBBBBBBBBBBB!
Result for MyTrainable_a224e_00000:
  date: 2022-12-27_18-38-50
  done: true
  dummy_accuracy: 222
  experiment_id: 10b1108ecbf747c09d5d141a4e20e8b6
  hostname: rail-PR4764GW
  iterations_since_restore: 1
  node_ip: 10.20.28.72
  pid: 11404
  time_since_restore: 3.0151374340057373
  time_this_iter_s: 3.0151374340057373
  time_total_s: 3.0151374340057373
  timestamp: 1672137530
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: a224e_00000
  warmup_time: 0.0045013427734375
  
== Status ==
Current time: 2022-12-27 18:38:50 (running for 00:00:06.23)
Memory usage on this node: 29.2/125.8 GiB 
Using FIFO scheduling algorithm.
Resources requested: 1.0/32 CPUs, 0/4 GPUs, 0.0/68.54 GiB heap, 0.0/33.36 GiB objects (0.0/1.0 accelerator_type:GTX)
Result logdir: /home/jychen/ray_results/MyTrainable_2022-12-27_18-38-42
Number of trials: 1/1000 (1 RUNNING)
+-------------------------+----------+-------------------+--------+------------------+------------------+
| Trial name              | status   | loc               |   iter |   total time (s) |   dummy_accuracy |
|-------------------------+----------+-------------------+--------+------------------+------------------|
| MyTrainable_a224e_00000 | RUNNING  | 10.20.28.72:11404 |      1 |          3.01514 |              222 |
+-------------------------+----------+-------------------+--------+------------------+------------------+


(raylet) [2022-12-27 18:38:50,585 E 638 802] (raylet) file_system_monitor.cc:105: /tmp/ray/session_2022-12-26_22-43-48_950666_32692 is over 95% full, available space: 9132666880; capacity: 502930677760. Object creation will fail if spilling is required.
(MyTrainable pid=11581) MyTrainable.setup config={}
(MyTrainable pid=11581) MyTrainable.step start, I am BBBBBBBBBBBBB!
== Status ==
Current time: 2022-12-27 18:38:55 (running for 00:00:11.26)
Memory usage on this node: 29.2/125.8 GiB 
Using FIFO scheduling algorithm.
Resources requested: 1.0/32 CPUs, 0/4 GPUs, 0.0/68.54 GiB heap, 0.0/33.36 GiB objects (0.0/1.0 accelerator_type:GTX)
Result logdir: /home/jychen/ray_results/MyTrainable_2022-12-27_18-38-42
Number of trials: 2/1000 (1 RUNNING, 1 TERMINATED)
+-------------------------+------------+-------------------+--------+------------------+------------------+
| Trial name              | status     | loc               |   iter |   total time (s) |   dummy_accuracy |
|-------------------------+------------+-------------------+--------+------------------+------------------|
| MyTrainable_a224e_00001 | RUNNING    | 10.20.28.72:11581 |        |                  |                  |
| MyTrainable_a224e_00000 | TERMINATED | 10.20.28.72:11404 |      1 |          3.01514 |              222 |
+-------------------------+------------+-------------------+--------+------------------+------------------+


Result for MyTrainable_a224e_00001:
  date: 2022-12-27_18-38-55
  done: true
  dummy_accuracy: 222
  experiment_id: 40bee4f9b1254df9a45338e8fccf371a
  hostname: rail-PR4764GW
  iterations_since_restore: 1
  node_ip: 10.20.28.72
  pid: 11581
  time_since_restore: 3.012342929840088
  time_this_iter_s: 3.012342929840088
  time_total_s: 3.012342929840088
  timestamp: 1672137535
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: a224e_00001
  warmup_time: 0.003916025161743164
  
(MyTrainable pid=11729) MyTrainable.setup config={}
(MyTrainable pid=11729) MyTrainable.step start, I am BBBBBBBBBBBBB!
== Status ==
Current time: 2022-12-27 18:39:00 (running for 00:00:16.42)
Memory usage on this node: 29.2/125.8 GiB 
Using FIFO scheduling algorithm.
Resources requested: 1.0/32 CPUs, 0/4 GPUs, 0.0/68.54 GiB heap, 0.0/33.36 GiB objects (0.0/1.0 accelerator_type:GTX)
Result logdir: /home/jychen/ray_results/MyTrainable_2022-12-27_18-38-42
Number of trials: 3/1000 (1 RUNNING, 2 TERMINATED)
+-------------------------+------------+-------------------+--------+------------------+------------------+
| Trial name              | status     | loc               |   iter |   total time (s) |   dummy_accuracy |
|-------------------------+------------+-------------------+--------+------------------+------------------|
| MyTrainable_a224e_00002 | RUNNING    | 10.20.28.72:11729 |        |                  |                  |
| MyTrainable_a224e_00000 | TERMINATED | 10.20.28.72:11404 |      1 |          3.01514 |              222 |
| MyTrainable_a224e_00001 | TERMINATED | 10.20.28.72:11581 |      1 |          3.01234 |              222 |
+-------------------------+------------+-------------------+--------+------------------+------------------+


(raylet) [2022-12-27 18:39:00,594 E 638 802] (raylet) file_system_monitor.cc:105: /tmp/ray/session_2022-12-26_22-43-48_950666_32692 is over 95% full, available space: 9132412928; capacity: 502930677760. Object creation will fail if spilling is required.
Result for MyTrainable_a224e_00002:
  date: 2022-12-27_18-39-00
  done: true
  dummy_accuracy: 222
  experiment_id: 38db704ef7ff4896b3f95564b5756136
  hostname: rail-PR4764GW
  iterations_since_restore: 1
  node_ip: 10.20.28.72
  pid: 11729
  time_since_restore: 3.0130319595336914
  time_this_iter_s: 3.0130319595336914
  time_total_s: 3.0130319595336914
  timestamp: 1672137540
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: a224e_00002
  warmup_time: 0.004384279251098633
  
(MyTrainable pid=11909) MyTrainable.setup config={}
(MyTrainable pid=11909) MyTrainable.step start, I am BBBBBBBBBBBBB!
== Status ==
Current time: 2022-12-27 18:39:05 (running for 00:00:21.74)
Memory usage on this node: 29.2/125.8 GiB 
Using FIFO scheduling algorithm.
Resources requested: 1.0/32 CPUs, 0/4 GPUs, 0.0/68.54 GiB heap, 0.0/33.36 GiB objects (0.0/1.0 accelerator_type:GTX)
Result logdir: /home/jychen/ray_results/MyTrainable_2022-12-27_18-38-42
Number of trials: 4/1000 (1 RUNNING, 3 TERMINATED)
+-------------------------+------------+-------------------+--------+------------------+------------------+
| Trial name              | status     | loc               |   iter |   total time (s) |   dummy_accuracy |
|-------------------------+------------+-------------------+--------+------------------+------------------|
| MyTrainable_a224e_00003 | RUNNING    | 10.20.28.72:11909 |        |                  |                  |
| MyTrainable_a224e_00000 | TERMINATED | 10.20.28.72:11404 |      1 |          3.01514 |              222 |
| MyTrainable_a224e_00001 | TERMINATED | 10.20.28.72:11581 |      1 |          3.01234 |              222 |
| MyTrainable_a224e_00002 | TERMINATED | 10.20.28.72:11729 |      1 |          3.01303 |              222 |
+-------------------------+------------+-------------------+--------+------------------+------------------+


Result for MyTrainable_a224e_00003:
  date: 2022-12-27_18-39-06
  done: true
  dummy_accuracy: 222
  experiment_id: 054e6d504e284671abffe09d7b9c962f
  hostname: rail-PR4764GW
  iterations_since_restore: 1
  node_ip: 10.20.28.72
  pid: 11909
  time_since_restore: 3.014749050140381
  time_this_iter_s: 3.014749050140381
  time_total_s: 3.014749050140381
  timestamp: 1672137546
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: a224e_00003
  warmup_time: 0.004586219787597656
  
(MyTrainable pid=12073) MyTrainable.setup config={}
(MyTrainable pid=12073) MyTrainable.step start, I am BBBBBBBBBBBBB!
(raylet) [2022-12-27 18:39:10,603 E 638 802] (raylet) file_system_monitor.cc:105: /tmp/ray/session_2022-12-26_22-43-48_950666_32692 is over 95% full, available space: 9132158976; capacity: 502930677760. Object creation will fail if spilling is required.
== Status ==
Current time: 2022-12-27 18:39:11 (running for 00:00:26.89)
Memory usage on this node: 29.3/125.8 GiB 
Using FIFO scheduling algorithm.
Resources requested: 1.0/32 CPUs, 0/4 GPUs, 0.0/68.54 GiB heap, 0.0/33.36 GiB objects (0.0/1.0 accelerator_type:GTX)
Result logdir: /home/jychen/ray_results/MyTrainable_2022-12-27_18-38-42
Number of trials: 5/1000 (1 RUNNING, 4 TERMINATED)
+-------------------------+------------+-------------------+--------+------------------+------------------+
| Trial name              | status     | loc               |   iter |   total time (s) |   dummy_accuracy |
|-------------------------+------------+-------------------+--------+------------------+------------------|
| MyTrainable_a224e_00004 | RUNNING    | 10.20.28.72:12073 |        |                  |                  |
| MyTrainable_a224e_00000 | TERMINATED | 10.20.28.72:11404 |      1 |          3.01514 |              222 |
| MyTrainable_a224e_00001 | TERMINATED | 10.20.28.72:11581 |      1 |          3.01234 |              222 |
| MyTrainable_a224e_00002 | TERMINATED | 10.20.28.72:11729 |      1 |          3.01303 |              222 |
| MyTrainable_a224e_00003 | TERMINATED | 10.20.28.72:11909 |      1 |          3.01475 |              222 |
+-------------------------+------------+-------------------+--------+------------------+------------------+


Result for MyTrainable_a224e_00004:
  date: 2022-12-27_18-39-11
  done: true
  dummy_accuracy: 222
  experiment_id: 3725ff5b47cb4657b8b27b6e76cddb76
  hostname: rail-PR4764GW
  iterations_since_restore: 1
  node_ip: 10.20.28.72
  pid: 12073
  time_since_restore: 3.0146608352661133
  time_this_iter_s: 3.0146608352661133
  time_total_s: 3.0146608352661133
  timestamp: 1672137551
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: a224e_00004
  warmup_time: 0.004378080368041992
  
(MyTrainable pid=12252) MyTrainable.setup config={}
(MyTrainable pid=12252) MyTrainable.step start, I am BBBBBBBBBBBBB!
^C2022-12-27 18:39:15,311       WARNING tune.py:705 -- Stop signal received (e.g. via SIGINT/Ctrl+C), ending Ray Tune run. This will try to checkpoint the experiment state one last time. Press CTRL+C (or send SIGINT/SIGKILL/SIGTERM) to skip. 
== Status ==
Current time: 2022-12-27 18:39:16 (running for 00:00:32.11)
Memory usage on this node: 29.2/125.8 GiB 
Using FIFO scheduling algorithm.
Resources requested: 1.0/32 CPUs, 0/4 GPUs, 0.0/68.54 GiB heap, 0.0/33.36 GiB objects (0.0/1.0 accelerator_type:GTX)
Result logdir: /home/jychen/ray_results/MyTrainable_2022-12-27_18-38-42
Number of trials: 6/1000 (1 RUNNING, 5 TERMINATED)
+-------------------------+------------+-------------------+--------+------------------+------------------+
| Trial name              | status     | loc               |   iter |   total time (s) |   dummy_accuracy |
|-------------------------+------------+-------------------+--------+------------------+------------------|
| MyTrainable_a224e_00005 | RUNNING    | 10.20.28.72:12252 |        |                  |                  |
| MyTrainable_a224e_00000 | TERMINATED | 10.20.28.72:11404 |      1 |          3.01514 |              222 |
| MyTrainable_a224e_00001 | TERMINATED | 10.20.28.72:11581 |      1 |          3.01234 |              222 |
| MyTrainable_a224e_00002 | TERMINATED | 10.20.28.72:11729 |      1 |          3.01303 |              222 |
| MyTrainable_a224e_00003 | TERMINATED | 10.20.28.72:11909 |      1 |          3.01475 |              222 |
| MyTrainable_a224e_00004 | TERMINATED | 10.20.28.72:12073 |      1 |          3.01466 |              222 |
+-------------------------+------------+-------------------+--------+------------------+------------------+


== Status ==
Current time: 2022-12-27 18:39:16 (running for 00:00:32.11)
Memory usage on this node: 29.2/125.8 GiB 
Using FIFO scheduling algorithm.
Resources requested: 1.0/32 CPUs, 0/4 GPUs, 0.0/68.54 GiB heap, 0.0/33.36 GiB objects (0.0/1.0 accelerator_type:GTX)
Result logdir: /home/jychen/ray_results/MyTrainable_2022-12-27_18-38-42
Number of trials: 6/1000 (1 RUNNING, 5 TERMINATED)
+-------------------------+------------+-------------------+--------+------------------+------------------+
| Trial name              | status     | loc               |   iter |   total time (s) |   dummy_accuracy |
|-------------------------+------------+-------------------+--------+------------------+------------------|
| MyTrainable_a224e_00005 | RUNNING    | 10.20.28.72:12252 |        |                  |                  |
| MyTrainable_a224e_00000 | TERMINATED | 10.20.28.72:11404 |      1 |          3.01514 |              222 |
| MyTrainable_a224e_00001 | TERMINATED | 10.20.28.72:11581 |      1 |          3.01234 |              222 |
| MyTrainable_a224e_00002 | TERMINATED | 10.20.28.72:11729 |      1 |          3.01303 |              222 |
| MyTrainable_a224e_00003 | TERMINATED | 10.20.28.72:11909 |      1 |          3.01475 |              222 |
| MyTrainable_a224e_00004 | TERMINATED | 10.20.28.72:12073 |      1 |          3.01466 |              222 |
+-------------------------+------------+-------------------+--------+------------------+------------------+


2022-12-27 18:39:16,767 ERROR tune.py:773 -- Trials did not complete: [MyTrainable_a224e_00005]
2022-12-27 18:39:16,767 INFO tune.py:777 -- Total run time: 33.94 seconds (32.11 seconds for the tuning loop).
2022-12-27 18:39:16,767 WARNING tune.py:783 -- Experiment has been interrupted, but the most recent state was saved. You can continue running this experiment by passing `resume=True` to `tune.run()`

Issue Severity

High: It blocks me from completing my task.

fzyzcjy avatar Dec 27 '22 10:12 fzyzcjy

Hey @fzyzcjy, thanks for opening this issue!

Could you elaborate more on what you're trying to do? In particular, what are the implications of not being able to run two Tune scripts concurrently (you mentioned it made Tune useless for you)?

bveeramani avatar Dec 27 '22 23:12 bveeramani

@bveeramani Hi thanks for the reply.

Could you elaborate more on what you're trying to do?

I have mentioned in the "reproduction script" section, with all code that you can copy-and-paste to reproduce, as well as sample logs.

what are the implications of not being able to run two Tune scripts concurrently (you mentioned it made Tune useless for you)?

IMHO, a ray cluster is able to run multiple jobs such as tuning. Therefore, it is common for people (at least me) to have more than one ray tune script running in the cluster. In the dummy example above, it is a.py and b.py concurrently running. Surely, we expect the a.py script uses code in a.py and b.py uses code in b.py, but the bug makes a.py uses code in b.py.

fzyzcjy avatar Dec 27 '22 23:12 fzyzcjy

Thanks for the info!

I have mentioned in the "reproduction script" section, with all code that you can copy-and-paste to reproduce, as well as sample logs.

Yeah, I ran the reproduction scripts earlier. My question was more about why you wanted to run multiple jobs on a single cluster (for example, are you sharing a cluster with other users?). That said, I agree with you -- it seems like a reasonable thing to do.

Running multiple Tune scripts isn't currently supported. Here's a comment from https://github.com/ray-project/ray/issues/30091 that explains a workaround:

The main problem here again is that Ray Tune uses a global key value store to register both the trainable and its parameters. The workaround here is to rename the trainable (or override the name attribute) though multi tenancy is generally not supported at the moment - i.e. we never test it and we don't guarantee that it works. We hope to refactor some of this logic in the future, but it's lower priority.

bveeramani avatar Dec 28 '22 00:12 bveeramani

My question was more about why you wanted to run multiple jobs on a single cluster (for example, are you sharing a cluster with other users?)

In my case it is like: I start ray tune on my codebase at day 1 (suppose it takes 5 days to run). Then in day 2, after modifying some source code, I want to run ray tune on the new codebase as another experiment, when the previous ray tune is still running.

Running multiple Tune scripts isn't currently supported.

Well I am quite surprised at it! Could you please explain a bit what workflow should I use if I cannot run multiple tune scripts?

fzyzcjy avatar Dec 28 '22 00:12 fzyzcjy

The usual workflow is to start a separate Ray cluster for each tuning job. The reason for this is that two concurrently running tune runs will compete for resources, which can lead to unexpected behavior, e.g. one Tune run not running any trials for a long time, or deadlocks where no trials are running at all (usually only in distributed training).

In your case the issue is that Ray Tune uses a global experiment registry which uses the trainable name as an identifier. Thus, your second script overwrites the trainable with the name MyTrainable globally, so the next trials in the first script will use this overwritten trainable.

If you really need to run two experiments with different trainables at the same time, you mainly need to make sure that these trainables have different names. To achieve this you can do something like

import uuid

MyTrainable.__name__ = "MyTrainable" + str(uuid.uuid4())[:8]

which should resolve the conflicts. You should also set the TUNE_PLACEMENT_GROUP_CLEANUP_DISABLED=1 environment variable if you are running Ray 2.2 or below (this is not needed on the latest master).

krfricke avatar Dec 29 '22 06:12 krfricke

@krfricke Thank you for the clarification!

fzyzcjy avatar Dec 29 '22 07:12 fzyzcjy

This has also been discussed earlier, e.g. in #18851 or #21329. Why should one use the specific environment variable @krfricke ? What does it do?

thoglu avatar Jan 01 '23 19:01 thoglu

It's also been discussed on Slack, e.g.

  • https://ray-distributed.slack.com/archives/CNECXMW22/p1673356718620379
  • https://ray-distributed.slack.com/archives/CNECXMW22/p1641502218000300

so we may want to see if we can better support this in the future.

To be clear, the environment variable is not needed anymore in the future (latest master and starting Ray 2.3).

Previously, we cleaned up existing placement groups with the __tune__xyz prefix at the start of every run. This was because in the original implementation of placement groups, they were not properly garbage collected. Additionally, our previous resource management system was prone to resource leaks and had some reconciliation logic.

Thus, without setting the environment variable, existing PGs (even from currently running trials) would be cleaned up, and trials would be preempted. You could set the environment variable to avoid this pre-run cleanup.

In #30016 we refactored Tune's resource management so that this reconciliation and cleanup is not needed anymore.

krfricke avatar Jan 10 '23 17:01 krfricke

Related issue with a workaround: https://github.com/ray-project/ray/issues/30091

We'll aim to include a fix in Ray 2.4.

krfricke avatar Feb 15 '23 16:02 krfricke

Hi. Has this issue been resolved in Ray 2.4 or later? I'm trying to multiple ray tune jobs on a single compute cluster at a time and I'm blocked by that issue. Till now I was using ray 2.0.0, but I'll be happy to upgrade if this is fixed.

jeremi-eta avatar Aug 23 '23 19:08 jeremi-eta

This is fixed in 2.4+: https://github.com/ray-project/ray/pull/33095

justinvyu avatar Oct 30 '23 20:10 justinvyu