ray icon indicating copy to clipboard operation
ray copied to clipboard

Workflow: Reading workflow status can lead to corrupted json reads.

Open SebastianMorawiec opened this issue 1 year ago • 1 comments

What happened + What you expected to happen

In very rare scenarios we have observed exceptions thrown by json decoder internally by ray while reading task outputs.

After Gooding deep into the code of Ray I've noticed that workflow status handling is causing troubles.

This is going to be based on the code analisys only, here is initial stacktrace:

  File "/opt/orquestra/venv/lib/python3.11/site-packages/ray/workflow/workflow_access.py", line 40, in load_task_output_from_storage
    tid = wf_store.inspect_output(task_id)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/orquestra/venv/lib/python3.11/site-packages/ray/workflow/workflow_storage.py", line 496, in inspect_output
    status = self.load_workflow_status()
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/orquestra/venv/lib/python3.11/site-packages/ray/workflow/workflow_storage.py", line 736, in load_workflow_status
    return self._status_storage.load_workflow_status(self._workflow_id)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/orquestra/venv/lib/python3.11/site-packages/ray/workflow/workflow_storage.py", line 167, in load_workflow_status
    metadata = json.loads(raw_data)
               ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None

The issue: https://github.com/ray-project/ray/blob/e23aa6e0b6381a695223daf97616d31572f1940f/python/ray/workflow/workflow_storage.py#L150

This code shows that we write metadata to file: https://github.com/ray-project/ray/blob/e23aa6e0b6381a695223daf97616d31572f1940f/python/ray/workflow/workflow_storage.py#L152

There is file-like mutex of those writes to make it transactional and mark status as dirty for the time of writing. https://github.com/ray-project/ray/blob/e23aa6e0b6381a695223daf97616d31572f1940f/python/ray/workflow/workflow_storage.py#L150

However the read: https://github.com/ray-project/ray/blob/e23aa6e0b6381a695223daf97616d31572f1940f/python/ray/workflow/workflow_storage.py#L165 here does not respect the mutex and just reads things as-is leading to invalid jsons being read and parsed.

As this bigger stacktrace shows: File "/opt/orquestra/venv/lib/python3.11/site-packages/ray/_private/worker.py", line 2626, in get raise value ray.exceptions.RaySystemError: System error: ray::load_task_output_from_storage() (pid=2275, ip=10.200.155.95) File "/opt/orquestra/venv/lib/python3.11/site-packages/ray/workflow/workflow_access.py", line 40, in load_task_output_from_storage tid = wf_store.inspect_output(task_id) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/orquestra/venv/lib/python3.11/site-packages/ray/workflow/workflow_storage.py", line 496, in inspect_output status = self.load_workflow_status() ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/orquestra/venv/lib/python3.11/site-packages/ray/workflow/workflow_storage.py", line 736, in load_workflow_status return self._status_storage.load_workflow_status(self._workflow_id) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/orquestra/venv/lib/python3.11/site-packages/ray/workflow/workflow_storage.py", line 167, in load_workflow_status metadata = json.loads(raw_data) ^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/json/__init__.py", line 346, in loads return _default_decoder.decode(s) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/json/decoder.py", line 337, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/json/decoder.py", line 355, in raw_decode raise JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

This status read is done from some other non-workflow-manager process. So it proves that there can be those 2 processes that read and write the same file without protection leading to this issue.

Versions / Dependencies

Ray 2.11, Python 11 Happens on Ubuntu 20.04

Reproduction script

This is race-condition. Cannot provide fool-proof repro script for that. Happens rarely

Issue Severity

Medium: It is a significant difficulty but I can work around it.

SebastianMorawiec avatar Apr 29 '24 10:04 SebastianMorawiec

this is a great catch @SebastianMorawiec > would you have time to submit a PR to fix this issue? we can help review and merge it.

anyscalesam avatar Apr 29 '24 22:04 anyscalesam