ray icon indicating copy to clipboard operation
ray copied to clipboard

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 0: invalid start byte [<Ray component: tune }

Open Aricept094 opened this issue 1 year ago • 2 comments

What happened + What you expected to happen

I've recently started experiencing the same exact issue[ https://github.com/wandb/wandb/issues/7683#issue-2308957793 ] that I suspect is related to a recent NVIDIA driver update. Yesterday, I updated my NVIDIA driver to version 555.85, and since then, I've been encountering errors in both Ray Tune and wand.

Initially, I encountered the error in Ray Tune, but after modifying the nvidia_gpu.py file in python3.11/site-packages/ray/_private/accelerators/ to use Latin-1 encoding instead of UTF-8, I was able to get my Ray Tune project working again. The modified code is as follows:

try: pynvml.nvmlInit() except pynvml.NVMLError: return None # pynvml init failed device_count = pynvml.nvmlDeviceGetCount() cuda_device_type = None if device_count > 0: handle = pynvml.nvmlDeviceGetHandleByIndex(0) device_name = pynvml.nvmlDeviceGetName(handle) if isinstance(device_name, bytes): device_name = device_name.decode("latin1") # Changed from "utf-8" to "latin1" cuda_device_type = ( NvidiaGPUAcceleratorManager._gpu_name_to_accelerator_type(device_name) ) pynvml.nvmlShutdown() return cuda_device_type

However, I'm still experiencing issues with W&B, where I'm receiving errors and my metrics are not being monitored as intended.

Versions / Dependencies

OS: Windows (WSL2)

Python version: 3.11.8

ray : 2.22.0

nvidia driver version ( installed on windows ) : 555.85

Reproduction script

File "/home/aricept094/MY_Scripts/Ray7_pl_augment_new.py", line 1629, in analysis = tune.run( ^^^^^^^^^ File "/home/aricept094/miniconda3/envs/ray4/lib/python3.11/site-packages/ray/tune/tune.py", line 527, in run _ray_auto_init(entrypoint=error_message_map["entrypoint"]) File "/home/aricept094/miniconda3/envs/ray4/lib/python3.11/site-packages/ray/tune/tune.py", line 252, in _ray_auto_init ray.init() File "/home/aricept094/miniconda3/envs/ray4/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/home/aricept094/miniconda3/envs/ray4/lib/python3.11/site-packages/ray/_private/worker.py", line 1642, in init _global_node = ray._private.node.Node( ^^^^^^^^^^^^^^^^^^^^^^^ File "/home/aricept094/miniconda3/envs/ray4/lib/python3.11/site-packages/ray/_private/node.py", line 336, in init self.start_ray_processes() File "/home/aricept094/miniconda3/envs/ray4/lib/python3.11/site-packages/ray/_private/node.py", line 1396, in start_ray_processes resource_spec = self.get_resource_spec() ^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/aricept094/miniconda3/envs/ray4/lib/python3.11/site-packages/ray/_private/node.py", line 580, in get_resource_spec ).resolve(is_head=self.head, node_ip_address=self.node_ip_address) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/aricept094/miniconda3/envs/ray4/lib/python3.11/site-packages/ray/_private/resource_spec.py", line 215, in resolve accelerator_manager.get_current_node_accelerator_type() File "/home/aricept094/miniconda3/envs/ray4/lib/python3.11/site-packages/ray/_private/accelerators/nvidia_gpu.py", line 71, in get_current_node_accelerator_type device_name = device_name.decode("utf-8") ^^^^^^^^^^^^^^^^^^^^^^^^^^^ UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 0: invalid start byte

import re import os import logging from typing import Optional, List, Tuple

from ray._private.accelerators.accelerator import AcceleratorManager

logger = logging.getLogger(name)

CUDA_VISIBLE_DEVICES_ENV_VAR = "CUDA_VISIBLE_DEVICES" NOSET_CUDA_VISIBLE_DEVICES_ENV_VAR = "RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES"

NVIDIA_GPU_NAME_PATTERN = re.compile(r"\w+\s+([A-Z0-9]+)")

class NvidiaGPUAcceleratorManager(AcceleratorManager): """Nvidia GPU accelerators."""

@staticmethod
def get_resource_name() -> str:
    return "GPU"

@staticmethod
def get_visible_accelerator_ids_env_var() -> str:
    return CUDA_VISIBLE_DEVICES_ENV_VAR

@staticmethod
def get_current_process_visible_accelerator_ids() -> Optional[List[str]]:
    cuda_visible_devices = os.environ.get(
        NvidiaGPUAcceleratorManager.get_visible_accelerator_ids_env_var(), None
    )
    if cuda_visible_devices is None:
        return None

    if cuda_visible_devices == "":
        return []

    if cuda_visible_devices == "NoDevFiles":
        return []

    return list(cuda_visible_devices.split(","))

@staticmethod
def get_current_node_num_accelerators() -> int:
    import ray._private.thirdparty.pynvml as pynvml

    try:
        pynvml.nvmlInit()
    except pynvml.NVMLError:
        return 0  # pynvml init failed
    device_count = pynvml.nvmlDeviceGetCount()
    pynvml.nvmlShutdown()
    return device_count

@staticmethod
def get_current_node_accelerator_type() -> Optional[str]:
    import ray._private.thirdparty.pynvml as pynvml

    try:
        pynvml.nvmlInit()
    except pynvml.NVMLError:
        return None  # pynvml init failed
    device_count = pynvml.nvmlDeviceGetCount()
    cuda_device_type = None
    if device_count > 0:
        handle = pynvml.nvmlDeviceGetHandleByIndex(0)
        device_name = pynvml.nvmlDeviceGetName(handle)
        if isinstance(device_name, bytes):
            device_name = device_name.decode("utf-8")
        cuda_device_type = (
            NvidiaGPUAcceleratorManager._gpu_name_to_accelerator_type(device_name)
        )
    pynvml.nvmlShutdown()
    return cuda_device_type

@staticmethod
def _gpu_name_to_accelerator_type(name):
    if name is None:
        return None
    match = NVIDIA_GPU_NAME_PATTERN.match(name)
    return match.group(1) if match else None

@staticmethod
def validate_resource_request_quantity(
    quantity: float,
) -> Tuple[bool, Optional[str]]:
    return (True, None)

@staticmethod
def set_current_process_visible_accelerator_ids(
    visible_cuda_devices: List[str],
) -> None:
    if os.environ.get(NOSET_CUDA_VISIBLE_DEVICES_ENV_VAR):
        return

    os.environ[
        NvidiaGPUAcceleratorManager.get_visible_accelerator_ids_env_var()
    ] = ",".join([str(i) for i in visible_cuda_devices])

@staticmethod
def get_ec2_instance_num_accelerators(
    instance_type: str, instances: dict
) -> Optional[int]:
    if instance_type not in instances:
        return None

    gpus = instances[instance_type].get("GpuInfo", {}).get("Gpus")
    if gpus is not None:
        # TODO(ameer): currently we support one gpu type per node.
        assert len(gpus) == 1
        return gpus[0]["Count"]
    return None

@staticmethod
def get_ec2_instance_accelerator_type(
    instance_type: str, instances: dict
) -> Optional[str]:
    if instance_type not in instances:
        return None

    gpus = instances[instance_type].get("GpuInfo", {}).get("Gpus")
    if gpus is not None:
        # TODO(ameer): currently we support one gpu type per node.
        assert len(gpus) == 1
        return gpus[0]["Name"]
    return None

Issue Severity

High: It blocks me from completing my task.

Aricept094 avatar May 22 '24 08:05 Aricept094

Experiencing the same issue in WSL2 instance after updating to the latest NVIDIA driver 555.

Windows: 11 Ubuntu (WSL2): 22.04 Python: 3.10.12 Ray 2.20.0

import ray
ray.init()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/USER/.local/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/home/USER/.local/lib/python3.10/site-packages/ray/_private/worker.py", line 1642, in init
    _global_node = ray._private.node.Node(
  File "/home/USER/.local/lib/python3.10/site-packages/ray/_private/node.py", line 336, in __init__
    self.start_ray_processes()
  File "/home/USER/.local/lib/python3.10/site-packages/ray/_private/node.py", line 1396, in start_ray_processes
    resource_spec = self.get_resource_spec()
  File "/home/USER/.local/lib/python3.10/site-packages/ray/_private/node.py", line 571, in get_resource_spec
    self._resource_spec = ResourceSpec(
  File "/home/USER/.local/lib/python3.10/site-packages/ray/_private/resource_spec.py", line 215, in resolve
    accelerator_manager.get_current_node_accelerator_type()
  File "/home/USER/.local/lib/python3.10/site-packages/ray/_private/accelerators/nvidia_gpu.py", line 71, in get_current_node_accelerator_type
    device_name = device_name.decode("utf-8")
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 0: invalid start byte

Yatagarasu50469 avatar May 27 '24 07:05 Yatagarasu50469

the temporary fix would be rolling back to Nvidia driver to 552.44 , worked for me .

Aricept094 avatar May 27 '24 08:05 Aricept094

Indeed, did that right after posting, as a temporary workaround; not really a 'good' long-term solution moving forward though.

Yatagarasu50469 avatar May 27 '24 19:05 Yatagarasu50469

Indeed, did that right after posting, as a temporary workaround; not really a 'good' long-term solution moving forward though.

Agreed, my workaround , in the first issue,was fairly easy but only for ray ; for example the wandb still suffered and didn't work as intended.

Aricept094 avatar May 28 '24 05:05 Aricept094

Dask has same issue: https://github.com/dask/distributed/issues/5768

rynewang avatar May 28 '24 21:05 rynewang

The code that produces device_name comes from pynvml, see https://github.com/gpuopenanalytics/pynvml/issues/53. Can one of the people with the error try this smaller reproducer and report the results? I think this only happens on machines with non-english user interfaces.

import pynvml

pynvml.nvmlInit()
handle = pynvml.nvmlDeviceGetHandleByIndex(0)
print( [x for x in pynvml.nvmlDeviceGetName(handle)])

Edit: add nvmlInit, improve output to show characters

mattip avatar May 29 '24 04:05 mattip

It seems to be a problem with the new driver version, but only on WSL2. I commented on the upstream pynvml issue gpuopenanalytics/pynvml#53.

mattip avatar May 29 '24 15:05 mattip

It seems this is a known problem and will be resolved in a new driver release (from gpuopenanalytics/pynvml#53)

This issue has been escalated to the NVML team and the fix has been merged into the upcoming r560 driver branch. I do not believe there are plans to re-release the short-loved r555 branch.

mattip avatar May 30 '24 13:05 mattip

Here is my output. Maybe it helps.

>>> import pynvml
vmlDeviceGetHandleByIndex(0)
print( [x for x in pynvml.nvmlDeviceGetName(handle)])>>> 
>>> pynvml.nvmlInit()
>>> handle = pynvml.nvmlDeviceGetHandleByIndex(0)
>>> print( [x for x in pynvml.nvmlDeviceGetName(handle)])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/mk/.local/lib/python3.10/site-packages/pynvml.py", line 1921, in wrapper
    return res.decode()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 0: invalid start byte

MKLepium avatar Jun 09 '24 21:06 MKLepium

the temporary fix would be rolling back to Nvidia driver to 552.44 , worked for me .

I'm having the same issue when using ray.data.read_csv. Can you please provide instruction on how to roll it back ti 552.44? Thanks in advance

wxie2013 avatar Jun 14 '24 20:06 wxie2013

turns out just download the 552.44 from https://www.nvidia.com/download/driverResults.aspx/224484/en-us/ and install in windows.

wxie2013 avatar Jun 14 '24 21:06 wxie2013

Just updated to 560.70. Issue seems to be resolved!

MKLepium avatar Jul 19 '24 01:07 MKLepium

Thanks for the update. Closing. Please open a new issue if the problem is not solved with information from the newer drivers.

mattip avatar Jul 19 '24 05:07 mattip

Can we please integrate this fix into Ray? Thanks

device_name = device_name.decode("utf-8")

# Change to
try:
    device_name = device_name.decode('utf-16be')
except UnicodeDecodeError as e:
    device_name = device_name.decode("utf-8")

vladjohnson avatar Aug 05 '24 18:08 vladjohnson

I thought that inly leads to problems later on. Maybe we should error out with a message about updating drivers?

mattip avatar Aug 05 '24 20:08 mattip

It would be helpful to include information about drivers to provide some context; Ray definitely should be more developer-friendly.

My concern with the drivers is that we are dependent on NVIDIA, and having a wider range of supported encodings is a good idea, unless it leads to major problems. Having tested Ray with the current driver version, everything seems stable.

Thanks

vladjohnson avatar Aug 07 '24 16:08 vladjohnson

Ray definitely should be more developer-friendly.

ray sits near the bottom of a large stack of software. It tries to be as freindly as possible, but pulls in many pieces from third-party vendors: python packages, OS components, user code. Nt everything is under ray's control.

and having a wider range of supported encodings

The problem was that for a particular version of the nvidia drivers on WSL were buggy. The immediate victim was an API call to determine the driver version, but simply changing the encoding would only allow the unsuspecting user to get to the next bug, which could have crashed the machine or led to wrong answers. I think failing fast is the best choice here, we don't know if allowing unconventional encodings would have led to major problems or not. Perhaps the error message could be improved, but at the end of the day ray is dependent on third-party software API calls to work properly.

mattip avatar Aug 08 '24 04:08 mattip

Also just encountered this issue on WSL2 with NVIDIA driver version 555, and updating to latest 566.14 solved the the issue for me. Thanks for the tip!

mwilby-dendra avatar Nov 20 '24 16:11 mwilby-dendra