ray [Train Log]Ray Train Structured Logging

Why are these changes needed?

This PR creates the structured logging for ray train. The main structure follows the implementation of Ray Data's structured logging PR. Main components include:

python/ray/train/_internal/logging.py: this file defines the logging utility functions;
python/ray/train/_internal/logging.yaml: this file defines the yaml structure of how logging will be format and handled;
python/ray/train/tests/test_logging.py: this file provides the corresponding unit tests for the logging utilities.

Example

Code snippet:

import ray
from ray.train import ScalingConfig
from ray.train.torch import TorchTrainer
from ray.train._internal.logging import get_log_directory

ray.init()

def train_func_per_worker():
    pass

def train_dummy(num_workers=2):
    scaling_config = ScalingConfig(num_workers=num_workers, use_gpu=False)

    trainer = TorchTrainer(
        train_loop_per_worker=train_func_per_worker,
        scaling_config=scaling_config,
    )

    result = trainer.fit()
    print(f"Training result: {result}")

if __name__ == "__main__":
    print(f"Log directory: {get_log_directory()}")
    train_dummy()

JSON Logging

RAY_TRAIN_LOG_ENCODING="JSON" python log.py

{"asctime": "2024-10-07 15:30:50,357", "levelname": "DEBUG", "message": "StorageContext on SESSION (rank=None):\nStorageContext<\n  storage_filesystem='local',\n  storage_fs_path='/Users/hpguo/ray_results',\n  experiment_dir_name='TorchTrainer_2024-10-07_15-30-48',\n  trial_dir_name='TorchTrainer_cd309_00000_0_2024-10-07_15-30-48',\n  current_checkpoint_index=-1,\n>", "filename": "session.py", "lineno": 154, "job_id": "01000000", "worker_id": "14046ce50d4da0d524f48a88d5c107d616a7b4b38109eb05b464cc89", "node_id": "08c812bbed27258cd6c0ab8c4a1a450a1f3270c29feda5a5145f37d3", "actor_id": "7034f478a5e16029a5f8624b01000000", "task_id": "ffffffffffffffff7034f478a5e16029a5f8624b01000000"}
{"asctime": "2024-10-07 15:30:50,357", "levelname": "DEBUG", "message": "Changing the working directory to: /tmp/ray/session_2024-10-07_15-30-47_254188_77655/artifacts/2024-10-07_15-30-48/TorchTrainer_2024-10-07_15-30-48/working_dirs/TorchTrainer_cd309_00000_0_2024-10-07_15-30-48", "filename": "session.py", "lineno": 231, "job_id": "01000000", "worker_id": "14046ce50d4da0d524f48a88d5c107d616a7b4b38109eb05b464cc89", "node_id": "08c812bbed27258cd6c0ab8c4a1a450a1f3270c29feda5a5145f37d3", "actor_id": "7034f478a5e16029a5f8624b01000000", "task_id": "ffffffffffffffff7034f478a5e16029a5f8624b01000000"}
{"asctime": "2024-10-07 15:30:50,386", "levelname": "DEBUG", "message": "Starting 2 workers.", "filename": "worker_group.py", "lineno": 202, "job_id": "01000000", "worker_id": "14046ce50d4da0d524f48a88d5c107d616a7b4b38109eb05b464cc89", "node_id": "08c812bbed27258cd6c0ab8c4a1a450a1f3270c29feda5a5145f37d3", "actor_id": "7034f478a5e16029a5f8624b01000000", "task_id": "e3e06b97f47ff84e356227242797cb266a4ee92401000000"}
{"asctime": "2024-10-07 15:30:51,071", "levelname": "DEBUG", "message": "2 workers have successfully started.", "filename": "worker_group.py", "lineno": 204, "job_id": "01000000", "worker_id": "14046ce50d4da0d524f48a88d5c107d616a7b4b38109eb05b464cc89", "node_id": "08c812bbed27258cd6c0ab8c4a1a450a1f3270c29feda5a5145f37d3", "actor_id": "7034f478a5e16029a5f8624b01000000", "task_id": "e3e06b97f47ff84e356227242797cb266a4ee92401000000"}
{"asctime": "2024-10-07 15:30:51,918", "levelname": "DEBUG", "message": "Setting up process group for: env:// [rank=1, world_size=2]", "filename": "config.py", "lineno": 88, "job_id": "01000000", "worker_id": "9836e987f49f30e1380b4186bc857de51037f838a2743d14ad20e687", "node_id": "08c812bbed27258cd6c0ab8c4a1a450a1f3270c29feda5a5145f37d3", "actor_id": "75cacda28ffb9670fecced0801000000", "task_id": "16d949af2e9883d575cacda28ffb9670fecced0801000000"}
{"asctime": "2024-10-07 15:30:51,918", "levelname": "INFO", "message": "Setting up process group for: env:// [rank=0, world_size=2]", "filename": "config.py", "lineno": 83, "job_id": "01000000", "worker_id": "7071ecfe007ca0d51501ea88f34914f93557a989a8d0f9c1b8354ae6", "node_id": "08c812bbed27258cd6c0ab8c4a1a450a1f3270c29feda5a5145f37d3", "actor_id": "5d40febd826c5668608ec28601000000", "task_id": "2f0b149c8b9a04bd5d40febd826c5668608ec28601000000"}
{"asctime": "2024-10-07 15:30:51,918", "levelname": "DEBUG", "message": "using gloo", "filename": "config.py", "lineno": 92, "job_id": "01000000", "worker_id": "9836e987f49f30e1380b4186bc857de51037f838a2743d14ad20e687", "node_id": "08c812bbed27258cd6c0ab8c4a1a450a1f3270c29feda5a5145f37d3", "actor_id": "75cacda28ffb9670fecced0801000000", "task_id": "16d949af2e9883d575cacda28ffb9670fecced0801000000"}
{"asctime": "2024-10-07 15:30:51,918", "levelname": "DEBUG", "message": "using gloo", "filename": "config.py", "lineno": 92, "job_id": "01000000", "worker_id": "7071ecfe007ca0d51501ea88f34914f93557a989a8d0f9c1b8354ae6", "node_id": "08c812bbed27258cd6c0ab8c4a1a450a1f3270c29feda5a5145f37d3", "actor_id": "5d40febd826c5668608ec28601000000", "task_id": "2f0b149c8b9a04bd5d40febd826c5668608ec28601000000"}
{"asctime": "2024-10-07 15:30:51,954", "levelname": "INFO", "message": "Started distributed worker processes: \n- (node_id=08c812bbed27258cd6c0ab8c4a1a450a1f3270c29feda5a5145f37d3, ip=127.0.0.1, pid=77706) world_rank=0, local_rank=0, node_rank=0\n- (node_id=08c812bbed27258cd6c0ab8c4a1a450a1f3270c29feda5a5145f37d3, ip=127.0.0.1, pid=77707) world_rank=1, local_rank=1, node_rank=0", "filename": "backend_executor.py", "lineno": 447, "job_id": "01000000", "worker_id": "14046ce50d4da0d524f48a88d5c107d616a7b4b38109eb05b464cc89", "node_id": "08c812bbed27258cd6c0ab8c4a1a450a1f3270c29feda5a5145f37d3", "actor_id": "7034f478a5e16029a5f8624b01000000", "task_id": "e3e06b97f47ff84e356227242797cb266a4ee92401000000"}
{"asctime": "2024-10-07 15:30:51,983", "levelname": "DEBUG", "message": "StorageContext on SESSION (rank=1):\nStorageContext<\n  storage_filesystem='local',\n  storage_fs_path='/Users/hpguo/ray_results',\n  experiment_dir_name='TorchTrainer_2024-10-07_15-30-48',\n  trial_dir_name='TorchTrainer_cd309_00000_0_2024-10-07_15-30-48',\n  current_checkpoint_index=-1,\n>", "filename": "session.py", "lineno": 154, "job_id": "01000000", "worker_id": "9836e987f49f30e1380b4186bc857de51037f838a2743d14ad20e687", "node_id": "08c812bbed27258cd6c0ab8c4a1a450a1f3270c29feda5a5145f37d3", "actor_id": "75cacda28ffb9670fecced0801000000", "task_id": "bc0ded8c3bd30ccb75cacda28ffb9670fecced0801000000"}
{"asctime": "2024-10-07 15:30:51,983", "levelname": "DEBUG", "message": "StorageContext on SESSION (rank=0):\nStorageContext<\n  storage_filesystem='local',\n  storage_fs_path='/Users/hpguo/ray_results',\n  experiment_dir_name='TorchTrainer_2024-10-07_15-30-48',\n  trial_dir_name='TorchTrainer_cd309_00000_0_2024-10-07_15-30-48',\n  current_checkpoint_index=-1,\n>", "filename": "session.py", "lineno": 154, "job_id": "01000000", "worker_id": "7071ecfe007ca0d51501ea88f34914f93557a989a8d0f9c1b8354ae6", "node_id": "08c812bbed27258cd6c0ab8c4a1a450a1f3270c29feda5a5145f37d3", "actor_id": "5d40febd826c5668608ec28601000000", "task_id": "1a908c376b4739e75d40febd826c5668608ec28601000000"}
{"asctime": "2024-10-07 15:30:51,983", "levelname": "DEBUG", "message": "Changing the working directory to: /tmp/ray/session_2024-10-07_15-30-47_254188_77655/artifacts/2024-10-07_15-30-48/TorchTrainer_2024-10-07_15-30-48/working_dirs/TorchTrainer_cd309_00000_0_2024-10-07_15-30-48", "filename": "session.py", "lineno": 231, "job_id": "01000000", "worker_id": "9836e987f49f30e1380b4186bc857de51037f838a2743d14ad20e687", "node_id": "08c812bbed27258cd6c0ab8c4a1a450a1f3270c29feda5a5145f37d3", "actor_id": "75cacda28ffb9670fecced0801000000", "task_id": "bc0ded8c3bd30ccb75cacda28ffb9670fecced0801000000"}
{"asctime": "2024-10-07 15:30:51,983", "levelname": "DEBUG", "message": "Changing the working directory to: /tmp/ray/session_2024-10-07_15-30-47_254188_77655/artifacts/2024-10-07_15-30-48/TorchTrainer_2024-10-07_15-30-48/working_dirs/TorchTrainer_cd309_00000_0_2024-10-07_15-30-48", "filename": "session.py", "lineno": 231, "job_id": "01000000", "worker_id": "7071ecfe007ca0d51501ea88f34914f93557a989a8d0f9c1b8354ae6", "node_id": "08c812bbed27258cd6c0ab8c4a1a450a1f3270c29feda5a5145f37d3", "actor_id": "5d40febd826c5668608ec28601000000", "task_id": "1a908c376b4739e75d40febd826c5668608ec28601000000"}
{"asctime": "2024-10-07 15:30:52,996", "levelname": "DEBUG", "message": "Shutting down 2 workers.", "filename": "worker_group.py", "lineno": 216, "job_id": "01000000", "worker_id": "14046ce50d4da0d524f48a88d5c107d616a7b4b38109eb05b464cc89", "node_id": "08c812bbed27258cd6c0ab8c4a1a450a1f3270c29feda5a5145f37d3", "actor_id": "7034f478a5e16029a5f8624b01000000", "task_id": "e3e06b97f47ff84e356227242797cb266a4ee92401000000"}
{"asctime": "2024-10-07 15:30:53,007", "levelname": "DEBUG", "message": "Graceful termination failed. Falling back to force kill.", "filename": "worker_group.py", "lineno": 225, "job_id": "01000000", "worker_id": "14046ce50d4da0d524f48a88d5c107d616a7b4b38109eb05b464cc89", "node_id": "08c812bbed27258cd6c0ab8c4a1a450a1f3270c29feda5a5145f37d3", "actor_id": "7034f478a5e16029a5f8624b01000000", "task_id": "e3e06b97f47ff84e356227242797cb266a4ee92401000000"}
{"asctime": "2024-10-07 15:30:53,007", "levelname": "DEBUG", "message": "Shutdown successful.", "filename": "worker_group.py", "lineno": 230, "job_id": "01000000", "worker_id": "14046ce50d4da0d524f48a88d5c107d616a7b4b38109eb05b464cc89", "node_id": "08c812bbed27258cd6c0ab8c4a1a450a1f3270c29feda5a5145f37d3", "actor_id": "7034f478a5e16029a5f8624b01000000", "task_id": "e3e06b97f47ff84e356227242797cb266a4ee92401000000"}

Text Logging

RAY_TRAIN_LOG_ENCODING="TEXT" python log.py

2024-09-24 15:06:02,274	DEBUG session.py:154 -- StorageContext on SESSION (rank=None):
StorageContext<
  storage_filesystem='local',
  storage_fs_path='/Users/hpguo/ray_results',
  experiment_dir_name='TorchTrainer_2024-09-24_15-06-00',
  trial_dir_name='TorchTrainer_2edb7_00000_0_2024-09-24_15-06-00',
  current_checkpoint_index=-1,
>
2024-09-24 15:06:02,274	DEBUG session.py:231 -- Changing the working directory to: /tmp/ray/session_2024-09-24_15-05-58_999739_31907/artifacts/2024-09-24_15-06-00/TorchTrainer_2024-09-24_15-06-00/working_dirs/TorchTrainer_2edb7_00000_0_2024-09-24_15-06-00
2024-09-24 15:06:02,286	DEBUG worker_group.py:202 -- Starting 2 workers.
2024-09-24 15:06:02,973	DEBUG worker_group.py:204 -- 2 workers have successfully started.
2024-09-24 15:06:03,812	INFO config.py:83 -- Setting up process group for: env:// [rank=0, world_size=2]
2024-09-24 15:06:03,812	DEBUG config.py:88 -- Setting up process group for: env:// [rank=1, world_size=2]
2024-09-24 15:06:03,812	DEBUG config.py:92 -- using gloo
2024-09-24 15:06:03,812	DEBUG config.py:92 -- using gloo
2024-09-24 15:06:03,863	INFO backend_executor.py:447 -- Started distributed worker processes: 
- (node_id=f4b1ea9c06ed3425b929fb70ede36ada34e3e3131b0c00318a7dee8a, ip=127.0.0.1, pid=31968) world_rank=0, local_rank=0, node_rank=0
- (node_id=f4b1ea9c06ed3425b929fb70ede36ada34e3e3131b0c00318a7dee8a, ip=127.0.0.1, pid=31969) world_rank=1, local_rank=1, node_rank=0
2024-09-24 15:06:03,893	DEBUG session.py:154 -- StorageContext on SESSION (rank=0):
StorageContext<
  storage_filesystem='local',
  storage_fs_path='/Users/hpguo/ray_results',
  experiment_dir_name='TorchTrainer_2024-09-24_15-06-00',
  trial_dir_name='TorchTrainer_2edb7_00000_0_2024-09-24_15-06-00',
  current_checkpoint_index=-1,
>
2024-09-24 15:06:03,893	DEBUG session.py:154 -- StorageContext on SESSION (rank=1):
StorageContext<
  storage_filesystem='local',
  storage_fs_path='/Users/hpguo/ray_results',
  experiment_dir_name='TorchTrainer_2024-09-24_15-06-00',
  trial_dir_name='TorchTrainer_2edb7_00000_0_2024-09-24_15-06-00',
  current_checkpoint_index=-1,
>
2024-09-24 15:06:03,893	DEBUG session.py:231 -- Changing the working directory to: /tmp/ray/session_2024-09-24_15-05-58_999739_31907/artifacts/2024-09-24_15-06-00/TorchTrainer_2024-09-24_15-06-00/working_dirs/TorchTrainer_2edb7_00000_0_2024-09-24_15-06-00
2024-09-24 15:06:03,893	DEBUG session.py:231 -- Changing the working directory to: /tmp/ray/session_2024-09-24_15-05-58_999739_31907/artifacts/2024-09-24_15-06-00/TorchTrainer_2024-09-24_15-06-00/working_dirs/TorchTrainer_2edb7_00000_0_2024-09-24_15-06-00
2024-09-24 15:06:04,906	DEBUG worker_group.py:216 -- Shutting down 2 workers.
2024-09-24 15:06:04,912	DEBUG worker_group.py:225 -- Graceful termination failed. Falling back to force kill.
2024-09-24 15:06:04,912	DEBUG worker_group.py:230 -- Shutdown successful.

Related issue number

Checks

[x] I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
[ ] I've run scripts/format.sh to lint the changes in this PR.
[ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in doc/source/tune/api/ under the corresponding .rst file.
[x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- [x] Unit tests
- [ ] Release tests
- [ ] This PR is not tested :(

Sep 24 '24 06:09 hongpeng-guo

@omatthew98 @hongpeng-guo One other question I had: Why is the configuration via an environment variable rather than a config somewhere in the API? Is adding a config in the API (ex: DataContext / ray.train.RunConfig) a future plan?

That is a fair question, I was mostly going off what we already had in place which already used environment variables. I think there might be some argument to us wanting to use environment variables to ensure logging is configured as early as possible (e.g. on module initialization before A DataContext or train.RunConfig might exist), but not sure if that is the case.

Oct 17 '24 20:10 omatthew98

Just a heads up, based on this thread we are going to move our yaml configurations to python dictionary configurations.

Oct 17 '24 22:10 omatthew98

Comments handed, good to take a look for another round.

Oct 28 '24 05:10 hongpeng-guo

Can we include local, world, and node rank as part of the structure?

Oct 30 '24 21:10 alanwguo

Can we include local, world, and node rank as part of the structure?

Sure, I am thinking about adding more train only context in a followup PR. We also need to differentiate between driver, controller, and worker processes. All the ranks related concepts are only defined on worker processes.

Oct 31 '24 00:10 hongpeng-guo

Update: I added a TrainContextFilter that will append rank information (world_rank, local_rank, world_size, local_world_size, and node_rank) to the structured logging records if this piece of log is emitted from a train worker. cc @alanwguo

Nov 01 '24 21:11 hongpeng-guo

Updated the output in the PR description.

Note: In current implementation, we cannot assign the world_rank / run_id information to every single pieces of the train worker logs. Basically the worker processes are created first and then assigned ranks information. The logs of each process that happens before this actor got its train_session will not be able to obtain the rank info as well as the run_id. Most logs can be correctly added with the train context info.

In Train V2, the implementation will be a bit different. We will fix the above issue in V2, as there is less coupling with Tune, making it easier to solve this issue.

Nov 09 '24 12:11 hongpeng-guo

Close this one. Pivot for implementation in V2

Dec 19 '24 22:12 hongpeng-guo