ray icon indicating copy to clipboard operation
ray copied to clipboard

[core] Don't try to monitor zipped files

Open dayshah opened this issue 7 months ago • 5 comments

Why are these changes needed?

Some users may implement their own log rotation on autoscaler event files. If zipping the files during rotation, the event agent will still try to read zipped files and raise an exception causing that file monitor to fail. We try to open the files every 0.1 seconds. We only need to look for new events in .log files that are actively being written to.

This pr prevents opening and looking for changes in zipped files. There will be no further changes in these files, so we don't need to monitor them. This can lead to exceptions like

Exception: Read event file failed: /tmp/ray/session_2025-04-30_09-40-39_975641_2270/logs/events/event_AUTOSCALER.log.2.gz
    raise Exception(f"Read event file failed: {file}") from e
File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/dashboard/modules/event/event_utils.py", line 176, in 
_read_monitor_file

Related issue number

Checks

  • [ ] I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • [ ] I've run scripts/format.sh to lint the changes in this PR.
  • [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
    • [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in doc/source/tune/api/ under the corresponding .rst file.
  • [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • [ ] Unit tests
    • [ ] Release tests
    • [ ] This PR is not tested :(

dayshah avatar May 19 '25 22:05 dayshah

Lol so many questions:

  • How was this discovered?
  • Which component is monitoring these files and why?

israbbani avatar May 20 '25 00:05 israbbani

Lol so many questions:

  • How was this discovered?
  • Which component is monitoring these files and why?

added some extra context, discovered by looking at some user dashboard logs from a long running cluster

dayshah avatar May 20 '25 16:05 dayshah

As discussed, let's also mention the long term fix here.

jjyao avatar May 20 '25 18:05 jjyao

As discussed, let's also mention the long term fix here.

@edoakes @israbbani @MengjinYan Long term we'll kill this event agent / head entirely with the new observability stack, and Ray will stop writing events to files. Any events should follow a similar path to the new task events. An rpc with events should be sent to the gcs from the component emitting the events and the events will eventually get persisted in some storage, users can plug-in their own storage to persist.

dayshah avatar May 20 '25 19:05 dayshah

This pull request has been automatically marked as stale because it has not had any activity for 14 days. It will be closed in another 14 days if no further activity occurs. Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

github-actions[bot] avatar Jun 17 '25 00:06 github-actions[bot]

This pull request has been automatically marked as stale because it has not had any activity for 14 days. It will be closed in another 14 days if no further activity occurs. Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

github-actions[bot] avatar Jul 03 '25 12:07 github-actions[bot]

@edoakes

dayshah avatar Jul 03 '25 22:07 dayshah