dalle-flow
dalle-flow copied to clipboard
jina does not pass the right GPU in to clipseg
Describe the bug
Does not work:
- name: clipseg
env:
CUDA_VISIBLE_DEVICES: "GPU-87ddc7ee-c3eb-1181-1857-368f4c2bb8be"
XLA_PYTHON_CLIENT_ALLOCATOR: platform
replicas: 1
timeout_ready: -1
uses: executors/clipseg/config.yml
Works:
- name: clipseg
env:
CUDA_VISIBLE_DEVICES: "6"
XLA_PYTHON_CLIENT_ALLOCATOR: platform
replicas: 1
timeout_ready: -1
uses: executors/clipseg/config.yml
Describe how you solve it
I use the numeric GPU ID (sad)
Environment
- jina 3.8.3
- docarray 0.16.2
- jcloud 0.0.35
- jina-hubble-sdk 0.18.0
- jina-proto 0.1.13
- protobuf 3.20.1
- proto-backend cpp
- grpcio 1.47.0
- pyyaml 6.0
- python 3.8.10
- platform Linux
- platform-release 5.15.0-52-generic
- platform-version jina-ai/jina#58-Ubuntu SMP Thu Oct 13 08:03:55 UTC 2022
- architecture x86_64
- processor x86_64
- uid 2485377892357
- session-id fcbedcc8-5d43-11ed-9251-0242ac110005
- uptime 2022-11-05T19:56:49.977485
- ci-vendor (unset)
* JINA_DEFAULT_HOST (unset)
* JINA_DEFAULT_TIMEOUT_CTRL (unset)
* JINA_DEPLOYMENT_NAME (unset)
* JINA_DISABLE_UVLOOP (unset)
* JINA_EARLY_STOP (unset)
* JINA_FULL_CLI (unset)
* JINA_GATEWAY_IMAGE (unset)
* JINA_GRPC_RECV_BYTES (unset)
* JINA_GRPC_SEND_BYTES (unset)
* JINA_HUB_NO_IMAGE_REBUILD (unset)
* JINA_LOG_CONFIG (unset)
* JINA_LOG_LEVEL (unset)
* JINA_LOG_NO_COLOR (unset)
* JINA_MP_START_METHOD (unset)
* JINA_OPTOUT_TELEMETRY (unset)
* JINA_RANDOM_PORT_MAX (unset)
* JINA_RANDOM_PORT_MIN (unset)
Screenshots
N/A
Hey @mchaker ,
What is the backend you are using? what does clipseg
do? It seems that the DL backend does not understand the UUID
Hey @mchaker ,
Are you sure ur cuda version support MIG access?
https://docs.nvidia.com/datacenter/tesla/mig-user-guide/#cuda-baremetal
In this documentation, you see the drivers version that support this feature, plus the syntax to be used
Can you try changing your YAML to:
- name: clipseg
env:
CUDA_VISIBLE_DEVICES: "MIG-GPU-87ddc7ee-c3eb-1181-1857-368f4c2bb8be"
XLA_PYTHON_CLIENT_ALLOCATOR: platform
replicas: 1
timeout_ready: -1
uses: executors/clipseg/config.yml
or
- name: clipseg
env:
CUDA_VISIBLE_DEVICES: "MIG-87ddc7ee-c3eb-1181-1857-368f4c2bb8be"
XLA_PYTHON_CLIENT_ALLOCATOR: platform
replicas: 1
timeout_ready: -1
uses: executors/clipseg/config.yml
?
My NVIDIA driver version is 515, so it supports MIG.
However, I do not use MIG on my cards. I just use the main card UUID from nvidia-smi -L
.
I'll try the MIG prefix and report back.
clipseg
is an executor set up for Jina, I use the UUID GPU specification method with other executors and Jina passes the right GPU to the executor. For some reason it does not pass the right GPU to the clipseg
executor. :(
this is weird, do you have the source code of clipseg
? Can you check what is the value in the Executor when u do:
os.environ['CUDA_VISIBLE_DEVICES`]
?
What Jina does is simply to set the env vars for each of Executor process, so wether or not this is respected by the Executor should be the Executor or upstream problem.
I see - will check the os.environ
value and report back.
Hey @mchaker , any news about it?
@JoanFM yes - CUDA_VISIBLE_DEVICES
is GPU-87d2c7e5-c3eb-1181-1857-368f4c2bbbbb
in the container (proper GPU ID)
However Jina crashes with:
β Waiting stablemulti clipseg upscalerp40 realesrgan... ββββββββββββββΊββββββββββββββββββββββββββ 2/6 0:00:18CRITIβ¦ clipseg/rep-0@61 can not load the executor from executors/clipseg/config.yml [11/11/22 14:54:57]
ERROR clipseg/rep-0@61 RuntimeError('Attempting to deserialize object on CUDA device 0 but [11/11/22 14:54:57]
torch.cuda.device_count() is 0. Please use torch.load with map_location to map your storages to an
existing device.') during <class 'jina.serve.runtimes.worker.WorkerRuntime'> initialization
add "--quiet-error" to suppress the exception details
Traceback (most recent call last):
File "/dalle/dalle-flow/env/lib/python3.10/site-packages/jina/orchestrate/pods/__init__.py", line
74, in run
runtime = runtime_cls(
File "/dalle/dalle-flow/env/lib/python3.10/site-packages/jina/serve/runtimes/worker/__init__.py",
line 36, in __init__
super().__init__(args, **kwargs)
File "/dalle/dalle-flow/env/lib/python3.10/site-packages/jina/serve/runtimes/asyncio.py", line 80,
in __init__
self._loop.run_until_complete(self.async_setup())
File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
return future.result()
File "/dalle/dalle-flow/env/lib/python3.10/site-packages/jina/serve/runtimes/worker/__init__.py",
line 101, in async_setup
self._data_request_handler = DataRequestHandler(
File
"/dalle/dalle-flow/env/lib/python3.10/site-packages/jina/serve/runtimes/request_handlers/data_requesβ¦
line 49, in __init__
self._load_executor(
File
"/dalle/dalle-flow/env/lib/python3.10/site-packages/jina/serve/runtimes/request_handlers/data_requesβ¦
line 139, in _load_executor
self._executor: BaseExecutor = BaseExecutor.load_config(
File "/dalle/dalle-flow/env/lib/python3.10/site-packages/jina/jaml/__init__.py", line 760, in
load_config
obj = JAML.load(tag_yml, substitute=False, runtime_args=runtime_args)
File "/dalle/dalle-flow/env/lib/python3.10/site-packages/jina/jaml/__init__.py", line 174, in load
r = yaml.load(stream, Loader=get_jina_loader_with_runtime(runtime_args))
File "/dalle/dalle-flow/env/lib/python3.10/site-packages/yaml/__init__.py", line 81, in load
return loader.get_single_data()
File "/dalle/dalle-flow/env/lib/python3.10/site-packages/yaml/constructor.py", line 51, in
get_single_data
return self.construct_document(node)
File "/dalle/dalle-flow/env/lib/python3.10/site-packages/yaml/constructor.py", line 55, in
construct_document
data = self.construct_object(node)
File "/dalle/dalle-flow/env/lib/python3.10/site-packages/yaml/constructor.py", line 100, in
construct_object
data = constructor(self, node)
File "/dalle/dalle-flow/env/lib/python3.10/site-packages/jina/jaml/__init__.py", line 582, in
_from_yaml
return get_parser(cls, version=data.get('version', None)).parse(
File "/dalle/dalle-flow/env/lib/python3.10/site-packages/jina/jaml/parsers/executor/legacy.py",
line 45, in parse
obj = cls(
File "/dalle/dalle-flow/env/lib/python3.10/site-packages/jina/serve/executors/decorators.py", line
63, in arg_wrapper
f = func(self, *args, **kwargs)
File "/dalle/dalle-flow/env/lib/python3.10/site-packages/jina/serve/helper.py", line 71, in
arg_wrapper
f = func(self, *args, **kwargs)
File "/dalle/dalle-flow/executors/clipseg/executor.py", line 71, in __init__
torch.load(
File "/dalle/dalle-flow/env/lib/python3.10/site-packages/torch/serialization.py", line 789, in load
return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
File "/dalle/dalle-flow/env/lib/python3.10/site-packages/torch/serialization.py", line 1131, in
_load
result = unpickler.load()
File "/dalle/dalle-flow/env/lib/python3.10/site-packages/torch/serialization.py", line 1101, in
persistent_load
load_tensor(dtype, nbytes, key, _maybe_decode_ascii(location))
File "/dalle/dalle-flow/env/lib/python3.10/site-packages/torch/serialization.py", line 1083, in
load_tensor
wrap_storage=restore_location(storage, location),
File "/dalle/dalle-flow/env/lib/python3.10/site-packages/torch/serialization.py", line 1055, in
restore_location
return default_restore_location(storage, str(map_location))
File "/dalle/dalle-flow/env/lib/python3.10/site-packages/torch/serialization.py", line 215, in
default_restore_location
result = fn(storage, location)
File "/dalle/dalle-flow/env/lib/python3.10/site-packages/torch/serialization.py", line 182, in
_cuda_deserialize
device = validate_cuda_device(location)
File "/dalle/dalle-flow/env/lib/python3.10/site-packages/torch/serialization.py", line 173, in
validate_cuda_device
raise RuntimeError('Attempting to deserialize object on CUDA device '
RuntimeError: Attempting to deserialize object on CUDA device 0 but torch.cuda.device_count() is 0.
Please use torch.load with map_location to map your storages to an existing device.
DEBUG clipseg/rep-0@61 process terminated
Hey @mchaker ,
This problem is on the Executor and how they load into GPU, where are u getting it from? maybe we can open an issue on that repo and fix there?
I see - let me check with the developer and see where they are getting the executor from. Maybe it is custom.
I believe the issue may come from how the model was stored or something like this. in this case Jina has made sure that ur CUDA_VISIBLE_DEVICES
env var is well passed to the Executor.
I see -- I'll follow up with the executor authors and dig into the executor source. Thanks for your help!
@JoanFM actually it looks like the executor is from Jina: https://github.com/jina-ai/dalle-flow/blob/main/executors/clipseg/executor.py
The device for the model is simply mapped with:
model.load_state_dict(
torch.load(
f'{cache_path}/{WEIGHT_FOLDER_NAME}/rd64-uni.pth',
map_location=torch.device('cuda'),
),
strict=False,
)
In this case it appears that torch is unable to map the location. @mchaker before these lines in executors/clipseg/executor.py
you can add print(os.environ.get('CUDA_VISIBLE_DEVICES
))` to see what the environment actually is.
Hey @AmericanPresidentJimmyCarter, do you know what might be the problem why it cannot be loaded with that CUDA_VISIBLE_DEVICES
setting?
@JoanFM No, I will try to get you debug from the env. This appears to be a strange one.
I transfer the issue to DALLE-FLOW because the issue is specific to the Executor in this project
@AmericanPresidentJimmyCarter what do you need from the env?
Hey @mchaker , @AmericanPresidentJimmyCarter , any progress on this ?
I still do not know why it happens -- it's only this one specific executor that has the problem. We can upload to latest jina and see if it persists.
I updated jina using pip install -U jina
and the error still happens
RuntimeError: Attempting to deserialize object on CUDA device 0 but torch.cuda.device_count() is 0.