[Train Log]Ray Train Structured Logging
Why are these changes needed?
This PR creates the structured logging for ray train. The main structure follows the implementation of Ray Data's structured logging PR. Main components include:
-
python/ray/train/_internal/logging.py: this file defines the logging utility functions; -
python/ray/train/_internal/logging.yaml: this file defines the yaml structure of how logging will be format and handled; -
python/ray/train/tests/test_logging.py: this file provides the corresponding unit tests for the logging utilities.
Example
Code snippet:
import ray
from ray.train import ScalingConfig
from ray.train.torch import TorchTrainer
from ray.train._internal.logging import get_log_directory
ray.init()
def train_func_per_worker():
pass
def train_dummy(num_workers=2):
scaling_config = ScalingConfig(num_workers=num_workers, use_gpu=False)
trainer = TorchTrainer(
train_loop_per_worker=train_func_per_worker,
scaling_config=scaling_config,
)
result = trainer.fit()
print(f"Training result: {result}")
if __name__ == "__main__":
print(f"Log directory: {get_log_directory()}")
train_dummy()
JSON Logging
RAY_TRAIN_LOG_ENCODING="JSON" python log.py
{"asctime": "2024-10-07 15:30:50,357", "levelname": "DEBUG", "message": "StorageContext on SESSION (rank=None):\nStorageContext<\n storage_filesystem='local',\n storage_fs_path='/Users/hpguo/ray_results',\n experiment_dir_name='TorchTrainer_2024-10-07_15-30-48',\n trial_dir_name='TorchTrainer_cd309_00000_0_2024-10-07_15-30-48',\n current_checkpoint_index=-1,\n>", "filename": "session.py", "lineno": 154, "job_id": "01000000", "worker_id": "14046ce50d4da0d524f48a88d5c107d616a7b4b38109eb05b464cc89", "node_id": "08c812bbed27258cd6c0ab8c4a1a450a1f3270c29feda5a5145f37d3", "actor_id": "7034f478a5e16029a5f8624b01000000", "task_id": "ffffffffffffffff7034f478a5e16029a5f8624b01000000"}
{"asctime": "2024-10-07 15:30:50,357", "levelname": "DEBUG", "message": "Changing the working directory to: /tmp/ray/session_2024-10-07_15-30-47_254188_77655/artifacts/2024-10-07_15-30-48/TorchTrainer_2024-10-07_15-30-48/working_dirs/TorchTrainer_cd309_00000_0_2024-10-07_15-30-48", "filename": "session.py", "lineno": 231, "job_id": "01000000", "worker_id": "14046ce50d4da0d524f48a88d5c107d616a7b4b38109eb05b464cc89", "node_id": "08c812bbed27258cd6c0ab8c4a1a450a1f3270c29feda5a5145f37d3", "actor_id": "7034f478a5e16029a5f8624b01000000", "task_id": "ffffffffffffffff7034f478a5e16029a5f8624b01000000"}
{"asctime": "2024-10-07 15:30:50,386", "levelname": "DEBUG", "message": "Starting 2 workers.", "filename": "worker_group.py", "lineno": 202, "job_id": "01000000", "worker_id": "14046ce50d4da0d524f48a88d5c107d616a7b4b38109eb05b464cc89", "node_id": "08c812bbed27258cd6c0ab8c4a1a450a1f3270c29feda5a5145f37d3", "actor_id": "7034f478a5e16029a5f8624b01000000", "task_id": "e3e06b97f47ff84e356227242797cb266a4ee92401000000"}
{"asctime": "2024-10-07 15:30:51,071", "levelname": "DEBUG", "message": "2 workers have successfully started.", "filename": "worker_group.py", "lineno": 204, "job_id": "01000000", "worker_id": "14046ce50d4da0d524f48a88d5c107d616a7b4b38109eb05b464cc89", "node_id": "08c812bbed27258cd6c0ab8c4a1a450a1f3270c29feda5a5145f37d3", "actor_id": "7034f478a5e16029a5f8624b01000000", "task_id": "e3e06b97f47ff84e356227242797cb266a4ee92401000000"}
{"asctime": "2024-10-07 15:30:51,918", "levelname": "DEBUG", "message": "Setting up process group for: env:// [rank=1, world_size=2]", "filename": "config.py", "lineno": 88, "job_id": "01000000", "worker_id": "9836e987f49f30e1380b4186bc857de51037f838a2743d14ad20e687", "node_id": "08c812bbed27258cd6c0ab8c4a1a450a1f3270c29feda5a5145f37d3", "actor_id": "75cacda28ffb9670fecced0801000000", "task_id": "16d949af2e9883d575cacda28ffb9670fecced0801000000"}
{"asctime": "2024-10-07 15:30:51,918", "levelname": "INFO", "message": "Setting up process group for: env:// [rank=0, world_size=2]", "filename": "config.py", "lineno": 83, "job_id": "01000000", "worker_id": "7071ecfe007ca0d51501ea88f34914f93557a989a8d0f9c1b8354ae6", "node_id": "08c812bbed27258cd6c0ab8c4a1a450a1f3270c29feda5a5145f37d3", "actor_id": "5d40febd826c5668608ec28601000000", "task_id": "2f0b149c8b9a04bd5d40febd826c5668608ec28601000000"}
{"asctime": "2024-10-07 15:30:51,918", "levelname": "DEBUG", "message": "using gloo", "filename": "config.py", "lineno": 92, "job_id": "01000000", "worker_id": "9836e987f49f30e1380b4186bc857de51037f838a2743d14ad20e687", "node_id": "08c812bbed27258cd6c0ab8c4a1a450a1f3270c29feda5a5145f37d3", "actor_id": "75cacda28ffb9670fecced0801000000", "task_id": "16d949af2e9883d575cacda28ffb9670fecced0801000000"}
{"asctime": "2024-10-07 15:30:51,918", "levelname": "DEBUG", "message": "using gloo", "filename": "config.py", "lineno": 92, "job_id": "01000000", "worker_id": "7071ecfe007ca0d51501ea88f34914f93557a989a8d0f9c1b8354ae6", "node_id": "08c812bbed27258cd6c0ab8c4a1a450a1f3270c29feda5a5145f37d3", "actor_id": "5d40febd826c5668608ec28601000000", "task_id": "2f0b149c8b9a04bd5d40febd826c5668608ec28601000000"}
{"asctime": "2024-10-07 15:30:51,954", "levelname": "INFO", "message": "Started distributed worker processes: \n- (node_id=08c812bbed27258cd6c0ab8c4a1a450a1f3270c29feda5a5145f37d3, ip=127.0.0.1, pid=77706) world_rank=0, local_rank=0, node_rank=0\n- (node_id=08c812bbed27258cd6c0ab8c4a1a450a1f3270c29feda5a5145f37d3, ip=127.0.0.1, pid=77707) world_rank=1, local_rank=1, node_rank=0", "filename": "backend_executor.py", "lineno": 447, "job_id": "01000000", "worker_id": "14046ce50d4da0d524f48a88d5c107d616a7b4b38109eb05b464cc89", "node_id": "08c812bbed27258cd6c0ab8c4a1a450a1f3270c29feda5a5145f37d3", "actor_id": "7034f478a5e16029a5f8624b01000000", "task_id": "e3e06b97f47ff84e356227242797cb266a4ee92401000000"}
{"asctime": "2024-10-07 15:30:51,983", "levelname": "DEBUG", "message": "StorageContext on SESSION (rank=1):\nStorageContext<\n storage_filesystem='local',\n storage_fs_path='/Users/hpguo/ray_results',\n experiment_dir_name='TorchTrainer_2024-10-07_15-30-48',\n trial_dir_name='TorchTrainer_cd309_00000_0_2024-10-07_15-30-48',\n current_checkpoint_index=-1,\n>", "filename": "session.py", "lineno": 154, "job_id": "01000000", "worker_id": "9836e987f49f30e1380b4186bc857de51037f838a2743d14ad20e687", "node_id": "08c812bbed27258cd6c0ab8c4a1a450a1f3270c29feda5a5145f37d3", "actor_id": "75cacda28ffb9670fecced0801000000", "task_id": "bc0ded8c3bd30ccb75cacda28ffb9670fecced0801000000"}
{"asctime": "2024-10-07 15:30:51,983", "levelname": "DEBUG", "message": "StorageContext on SESSION (rank=0):\nStorageContext<\n storage_filesystem='local',\n storage_fs_path='/Users/hpguo/ray_results',\n experiment_dir_name='TorchTrainer_2024-10-07_15-30-48',\n trial_dir_name='TorchTrainer_cd309_00000_0_2024-10-07_15-30-48',\n current_checkpoint_index=-1,\n>", "filename": "session.py", "lineno": 154, "job_id": "01000000", "worker_id": "7071ecfe007ca0d51501ea88f34914f93557a989a8d0f9c1b8354ae6", "node_id": "08c812bbed27258cd6c0ab8c4a1a450a1f3270c29feda5a5145f37d3", "actor_id": "5d40febd826c5668608ec28601000000", "task_id": "1a908c376b4739e75d40febd826c5668608ec28601000000"}
{"asctime": "2024-10-07 15:30:51,983", "levelname": "DEBUG", "message": "Changing the working directory to: /tmp/ray/session_2024-10-07_15-30-47_254188_77655/artifacts/2024-10-07_15-30-48/TorchTrainer_2024-10-07_15-30-48/working_dirs/TorchTrainer_cd309_00000_0_2024-10-07_15-30-48", "filename": "session.py", "lineno": 231, "job_id": "01000000", "worker_id": "9836e987f49f30e1380b4186bc857de51037f838a2743d14ad20e687", "node_id": "08c812bbed27258cd6c0ab8c4a1a450a1f3270c29feda5a5145f37d3", "actor_id": "75cacda28ffb9670fecced0801000000", "task_id": "bc0ded8c3bd30ccb75cacda28ffb9670fecced0801000000"}
{"asctime": "2024-10-07 15:30:51,983", "levelname": "DEBUG", "message": "Changing the working directory to: /tmp/ray/session_2024-10-07_15-30-47_254188_77655/artifacts/2024-10-07_15-30-48/TorchTrainer_2024-10-07_15-30-48/working_dirs/TorchTrainer_cd309_00000_0_2024-10-07_15-30-48", "filename": "session.py", "lineno": 231, "job_id": "01000000", "worker_id": "7071ecfe007ca0d51501ea88f34914f93557a989a8d0f9c1b8354ae6", "node_id": "08c812bbed27258cd6c0ab8c4a1a450a1f3270c29feda5a5145f37d3", "actor_id": "5d40febd826c5668608ec28601000000", "task_id": "1a908c376b4739e75d40febd826c5668608ec28601000000"}
{"asctime": "2024-10-07 15:30:52,996", "levelname": "DEBUG", "message": "Shutting down 2 workers.", "filename": "worker_group.py", "lineno": 216, "job_id": "01000000", "worker_id": "14046ce50d4da0d524f48a88d5c107d616a7b4b38109eb05b464cc89", "node_id": "08c812bbed27258cd6c0ab8c4a1a450a1f3270c29feda5a5145f37d3", "actor_id": "7034f478a5e16029a5f8624b01000000", "task_id": "e3e06b97f47ff84e356227242797cb266a4ee92401000000"}
{"asctime": "2024-10-07 15:30:53,007", "levelname": "DEBUG", "message": "Graceful termination failed. Falling back to force kill.", "filename": "worker_group.py", "lineno": 225, "job_id": "01000000", "worker_id": "14046ce50d4da0d524f48a88d5c107d616a7b4b38109eb05b464cc89", "node_id": "08c812bbed27258cd6c0ab8c4a1a450a1f3270c29feda5a5145f37d3", "actor_id": "7034f478a5e16029a5f8624b01000000", "task_id": "e3e06b97f47ff84e356227242797cb266a4ee92401000000"}
{"asctime": "2024-10-07 15:30:53,007", "levelname": "DEBUG", "message": "Shutdown successful.", "filename": "worker_group.py", "lineno": 230, "job_id": "01000000", "worker_id": "14046ce50d4da0d524f48a88d5c107d616a7b4b38109eb05b464cc89", "node_id": "08c812bbed27258cd6c0ab8c4a1a450a1f3270c29feda5a5145f37d3", "actor_id": "7034f478a5e16029a5f8624b01000000", "task_id": "e3e06b97f47ff84e356227242797cb266a4ee92401000000"}
Text Logging
RAY_TRAIN_LOG_ENCODING="TEXT" python log.py
2024-09-24 15:06:02,274 DEBUG session.py:154 -- StorageContext on SESSION (rank=None):
StorageContext<
storage_filesystem='local',
storage_fs_path='/Users/hpguo/ray_results',
experiment_dir_name='TorchTrainer_2024-09-24_15-06-00',
trial_dir_name='TorchTrainer_2edb7_00000_0_2024-09-24_15-06-00',
current_checkpoint_index=-1,
>
2024-09-24 15:06:02,274 DEBUG session.py:231 -- Changing the working directory to: /tmp/ray/session_2024-09-24_15-05-58_999739_31907/artifacts/2024-09-24_15-06-00/TorchTrainer_2024-09-24_15-06-00/working_dirs/TorchTrainer_2edb7_00000_0_2024-09-24_15-06-00
2024-09-24 15:06:02,286 DEBUG worker_group.py:202 -- Starting 2 workers.
2024-09-24 15:06:02,973 DEBUG worker_group.py:204 -- 2 workers have successfully started.
2024-09-24 15:06:03,812 INFO config.py:83 -- Setting up process group for: env:// [rank=0, world_size=2]
2024-09-24 15:06:03,812 DEBUG config.py:88 -- Setting up process group for: env:// [rank=1, world_size=2]
2024-09-24 15:06:03,812 DEBUG config.py:92 -- using gloo
2024-09-24 15:06:03,812 DEBUG config.py:92 -- using gloo
2024-09-24 15:06:03,863 INFO backend_executor.py:447 -- Started distributed worker processes:
- (node_id=f4b1ea9c06ed3425b929fb70ede36ada34e3e3131b0c00318a7dee8a, ip=127.0.0.1, pid=31968) world_rank=0, local_rank=0, node_rank=0
- (node_id=f4b1ea9c06ed3425b929fb70ede36ada34e3e3131b0c00318a7dee8a, ip=127.0.0.1, pid=31969) world_rank=1, local_rank=1, node_rank=0
2024-09-24 15:06:03,893 DEBUG session.py:154 -- StorageContext on SESSION (rank=0):
StorageContext<
storage_filesystem='local',
storage_fs_path='/Users/hpguo/ray_results',
experiment_dir_name='TorchTrainer_2024-09-24_15-06-00',
trial_dir_name='TorchTrainer_2edb7_00000_0_2024-09-24_15-06-00',
current_checkpoint_index=-1,
>
2024-09-24 15:06:03,893 DEBUG session.py:154 -- StorageContext on SESSION (rank=1):
StorageContext<
storage_filesystem='local',
storage_fs_path='/Users/hpguo/ray_results',
experiment_dir_name='TorchTrainer_2024-09-24_15-06-00',
trial_dir_name='TorchTrainer_2edb7_00000_0_2024-09-24_15-06-00',
current_checkpoint_index=-1,
>
2024-09-24 15:06:03,893 DEBUG session.py:231 -- Changing the working directory to: /tmp/ray/session_2024-09-24_15-05-58_999739_31907/artifacts/2024-09-24_15-06-00/TorchTrainer_2024-09-24_15-06-00/working_dirs/TorchTrainer_2edb7_00000_0_2024-09-24_15-06-00
2024-09-24 15:06:03,893 DEBUG session.py:231 -- Changing the working directory to: /tmp/ray/session_2024-09-24_15-05-58_999739_31907/artifacts/2024-09-24_15-06-00/TorchTrainer_2024-09-24_15-06-00/working_dirs/TorchTrainer_2edb7_00000_0_2024-09-24_15-06-00
2024-09-24 15:06:04,906 DEBUG worker_group.py:216 -- Shutting down 2 workers.
2024-09-24 15:06:04,912 DEBUG worker_group.py:225 -- Graceful termination failed. Falling back to force kill.
2024-09-24 15:06:04,912 DEBUG worker_group.py:230 -- Shutdown successful.
Related issue number
Checks
- [x] I've signed off every commit(by using the -s flag, i.e.,
git commit -s) in this PR. - [ ] I've run
scripts/format.shto lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I added a
method in Tune, I've added it in
doc/source/tune/api/under the corresponding.rstfile.
- [ ] I've added any new APIs to the API Reference. For example, if I added a
method in Tune, I've added it in
- [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
- [x] Unit tests
- [ ] Release tests
- [ ] This PR is not tested :(
@omatthew98 @hongpeng-guo One other question I had: Why is the configuration via an environment variable rather than a config somewhere in the API? Is adding a config in the API (ex:
DataContext/ray.train.RunConfig) a future plan?
That is a fair question, I was mostly going off what we already had in place which already used environment variables. I think there might be some argument to us wanting to use environment variables to ensure logging is configured as early as possible (e.g. on module initialization before A DataContext or train.RunConfig might exist), but not sure if that is the case.
Just a heads up, based on this thread we are going to move our yaml configurations to python dictionary configurations.
Comments handed, good to take a look for another round.
Can we include local, world, and node rank as part of the structure?
Can we include local, world, and node rank as part of the structure?
Sure, I am thinking about adding more train only context in a followup PR. We also need to differentiate between driver, controller, and worker processes. All the ranks related concepts are only defined on worker processes.
Update: I added a TrainContextFilter that will append rank information (world_rank, local_rank, world_size, local_world_size, and node_rank) to the structured logging records if this piece of log is emitted from a train worker.
cc @alanwguo
Updated the output in the PR description.
Note: In current implementation, we cannot assign the world_rank / run_id information to every single pieces of the train worker logs. Basically the worker processes are created first and then assigned ranks information. The logs of each process that happens before this actor got its train_session will not be able to obtain the rank info as well as the run_id. Most logs can be correctly added with the train context info.
In Train V2, the implementation will be a bit different. We will fix the above issue in V2, as there is less coupling with Tune, making it easier to solve this issue.
Close this one. Pivot for implementation in V2