dalle-flow jina does not pass the right GPU in to clipseg

Describe the bug

Does not work:

- name: clipseg
  env:
    CUDA_VISIBLE_DEVICES: "GPU-87ddc7ee-c3eb-1181-1857-368f4c2bb8be"
    XLA_PYTHON_CLIENT_ALLOCATOR: platform
  replicas: 1
  timeout_ready: -1
  uses: executors/clipseg/config.yml

Works:

- name: clipseg
  env:
    CUDA_VISIBLE_DEVICES: "6"
    XLA_PYTHON_CLIENT_ALLOCATOR: platform
  replicas: 1
  timeout_ready: -1
  uses: executors/clipseg/config.yml

Describe how you solve it

I use the numeric GPU ID (sad)

Environment

- jina 3.8.3
- docarray 0.16.2
- jcloud 0.0.35
- jina-hubble-sdk 0.18.0
- jina-proto 0.1.13
- protobuf 3.20.1
- proto-backend cpp
- grpcio 1.47.0
- pyyaml 6.0
- python 3.8.10
- platform Linux
- platform-release 5.15.0-52-generic
- platform-version jina-ai/jina#58-Ubuntu SMP Thu Oct 13 08:03:55 UTC 2022
- architecture x86_64
- processor x86_64
- uid 2485377892357
- session-id fcbedcc8-5d43-11ed-9251-0242ac110005
- uptime 2022-11-05T19:56:49.977485
- ci-vendor (unset)
* JINA_DEFAULT_HOST (unset)
* JINA_DEFAULT_TIMEOUT_CTRL (unset)
* JINA_DEPLOYMENT_NAME (unset)
* JINA_DISABLE_UVLOOP (unset)
* JINA_EARLY_STOP (unset)
* JINA_FULL_CLI (unset)
* JINA_GATEWAY_IMAGE (unset)
* JINA_GRPC_RECV_BYTES (unset)
* JINA_GRPC_SEND_BYTES (unset)
* JINA_HUB_NO_IMAGE_REBUILD (unset)
* JINA_LOG_CONFIG (unset)
* JINA_LOG_LEVEL (unset)
* JINA_LOG_NO_COLOR (unset)
* JINA_MP_START_METHOD (unset)
* JINA_OPTOUT_TELEMETRY (unset)
* JINA_RANDOM_PORT_MAX (unset)
* JINA_RANDOM_PORT_MIN (unset)

Screenshots

N/A

Nov 05 '22 20:11 mchaker

Hey @mchaker ,

What is the backend you are using? what does clipseg do? It seems that the DL backend does not understand the UUID

Nov 07 '22 10:11 JoanFM

Hey @mchaker ,

Are you sure ur cuda version support MIG access?

https://docs.nvidia.com/datacenter/tesla/mig-user-guide/#cuda-baremetal

In this documentation, you see the drivers version that support this feature, plus the syntax to be used

Nov 07 '22 11:11 JoanFM

Can you try changing your YAML to:

- name: clipseg
  env:
    CUDA_VISIBLE_DEVICES: "MIG-GPU-87ddc7ee-c3eb-1181-1857-368f4c2bb8be"
    XLA_PYTHON_CLIENT_ALLOCATOR: platform
  replicas: 1
  timeout_ready: -1
  uses: executors/clipseg/config.yml

or

- name: clipseg
  env:
    CUDA_VISIBLE_DEVICES: "MIG-87ddc7ee-c3eb-1181-1857-368f4c2bb8be"
    XLA_PYTHON_CLIENT_ALLOCATOR: platform
  replicas: 1
  timeout_ready: -1
  uses: executors/clipseg/config.yml

?

Nov 07 '22 11:11 JoanFM

My NVIDIA driver version is 515, so it supports MIG. However, I do not use MIG on my cards. I just use the main card UUID from nvidia-smi -L.

I'll try the MIG prefix and report back.

clipseg is an executor set up for Jina, I use the UUID GPU specification method with other executors and Jina passes the right GPU to the executor. For some reason it does not pass the right GPU to the clipseg executor. :(

Nov 07 '22 15:11 mchaker

this is weird, do you have the source code of clipseg? Can you check what is the value in the Executor when u do:

os.environ['CUDA_VISIBLE_DEVICES`]?

What Jina does is simply to set the env vars for each of Executor process, so wether or not this is respected by the Executor should be the Executor or upstream problem.

Nov 07 '22 15:11 JoanFM

I see - will check the os.environ value and report back.

Nov 07 '22 15:11 mchaker

Hey @mchaker , any news about it?

Nov 11 '22 11:11 JoanFM

@JoanFM yes - CUDA_VISIBLE_DEVICES is GPU-87d2c7e5-c3eb-1181-1857-368f4c2bbbbb in the container (proper GPU ID)

However Jina crashes with:

⠋ Waiting stablemulti clipseg upscalerp40 realesrgan... ━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━ 2/6 0:00:18CRITI… clipseg/rep-0@61 can not load the executor from executors/clipseg/config.yml                          [11/11/22 14:54:57]
ERROR  clipseg/rep-0@61 RuntimeError('Attempting to deserialize object on CUDA device 0 but                  [11/11/22 14:54:57]
       torch.cuda.device_count() is 0. Please use torch.load with map_location to map your storages to an
       existing device.') during <class 'jina.serve.runtimes.worker.WorkerRuntime'> initialization
        add "--quiet-error" to suppress the exception details
       Traceback (most recent call last):
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/jina/orchestrate/pods/__init__.py", line
       74, in run
           runtime = runtime_cls(
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/jina/serve/runtimes/worker/__init__.py",
       line 36, in __init__
           super().__init__(args, **kwargs)
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/jina/serve/runtimes/asyncio.py", line 80,
       in __init__
           self._loop.run_until_complete(self.async_setup())
         File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
           return future.result()
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/jina/serve/runtimes/worker/__init__.py",
       line 101, in async_setup
           self._data_request_handler = DataRequestHandler(
         File
       "/dalle/dalle-flow/env/lib/python3.10/site-packages/jina/serve/runtimes/request_handlers/data_reques…
       line 49, in __init__
           self._load_executor(
         File
       "/dalle/dalle-flow/env/lib/python3.10/site-packages/jina/serve/runtimes/request_handlers/data_reques…
       line 139, in _load_executor
           self._executor: BaseExecutor = BaseExecutor.load_config(
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/jina/jaml/__init__.py", line 760, in
       load_config
           obj = JAML.load(tag_yml, substitute=False, runtime_args=runtime_args)
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/jina/jaml/__init__.py", line 174, in load
           r = yaml.load(stream, Loader=get_jina_loader_with_runtime(runtime_args))
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/yaml/__init__.py", line 81, in load
           return loader.get_single_data()
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/yaml/constructor.py", line 51, in
       get_single_data
           return self.construct_document(node)
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/yaml/constructor.py", line 55, in
       construct_document
           data = self.construct_object(node)
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/yaml/constructor.py", line 100, in
       construct_object
           data = constructor(self, node)
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/jina/jaml/__init__.py", line 582, in
       _from_yaml
           return get_parser(cls, version=data.get('version', None)).parse(
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/jina/jaml/parsers/executor/legacy.py",
       line 45, in parse
           obj = cls(
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/jina/serve/executors/decorators.py", line
       63, in arg_wrapper
           f = func(self, *args, **kwargs)
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/jina/serve/helper.py", line 71, in
       arg_wrapper
           f = func(self, *args, **kwargs)
         File "/dalle/dalle-flow/executors/clipseg/executor.py", line 71, in __init__
           torch.load(
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/torch/serialization.py", line 789, in load
           return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/torch/serialization.py", line 1131, in
       _load
           result = unpickler.load()
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/torch/serialization.py", line 1101, in
       persistent_load
           load_tensor(dtype, nbytes, key, _maybe_decode_ascii(location))
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/torch/serialization.py", line 1083, in
       load_tensor
           wrap_storage=restore_location(storage, location),
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/torch/serialization.py", line 1055, in
       restore_location
           return default_restore_location(storage, str(map_location))
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/torch/serialization.py", line 215, in
       default_restore_location
           result = fn(storage, location)
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/torch/serialization.py", line 182, in
       _cuda_deserialize
           device = validate_cuda_device(location)
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/torch/serialization.py", line 173, in
       validate_cuda_device
           raise RuntimeError('Attempting to deserialize object on CUDA device '
       RuntimeError: Attempting to deserialize object on CUDA device 0 but torch.cuda.device_count() is 0.
       Please use torch.load with map_location to map your storages to an existing device.
DEBUG  clipseg/rep-0@61 process terminated

Nov 11 '22 14:11 mchaker

Hey @mchaker ,

This problem is on the Executor and how they load into GPU, where are u getting it from? maybe we can open an issue on that repo and fix there?

Nov 11 '22 15:11 JoanFM

I see - let me check with the developer and see where they are getting the executor from. Maybe it is custom.

Nov 11 '22 15:11 mchaker

I believe the issue may come from how the model was stored or something like this. in this case Jina has made sure that ur CUDA_VISIBLE_DEVICES env var is well passed to the Executor.

Nov 11 '22 15:11 JoanFM

I see -- I'll follow up with the executor authors and dig into the executor source. Thanks for your help!

Nov 11 '22 15:11 mchaker

@JoanFM actually it looks like the executor is from Jina: https://github.com/jina-ai/dalle-flow/blob/main/executors/clipseg/executor.py

Nov 11 '22 15:11 mchaker

The device for the model is simply mapped with:

        model.load_state_dict(
            torch.load(
                f'{cache_path}/{WEIGHT_FOLDER_NAME}/rd64-uni.pth',
                map_location=torch.device('cuda'),
            ),
            strict=False,
        )

In this case it appears that torch is unable to map the location. @mchaker before these lines in executors/clipseg/executor.py you can add print(os.environ.get('CUDA_VISIBLE_DEVICES))` to see what the environment actually is.

Nov 11 '22 15:11 AmericanPresidentJimmyCarter

Hey @AmericanPresidentJimmyCarter, do you know what might be the problem why it cannot be loaded with that CUDA_VISIBLE_DEVICES setting?

Nov 11 '22 15:11 JoanFM

@JoanFM No, I will try to get you debug from the env. This appears to be a strange one.

Nov 11 '22 15:11 AmericanPresidentJimmyCarter

I transfer the issue to DALLE-FLOW because the issue is specific to the Executor in this project

Nov 11 '22 15:11 JoanFM

@AmericanPresidentJimmyCarter what do you need from the env?

Nov 18 '22 19:11 mchaker

Hey @mchaker , @AmericanPresidentJimmyCarter , any progress on this ?

Nov 30 '22 08:11 JoanFM

I still do not know why it happens -- it's only this one specific executor that has the problem. We can upload to latest jina and see if it persists.

Nov 30 '22 14:11 AmericanPresidentJimmyCarter

I updated jina using pip install -U jina and the error still happens

RuntimeError: Attempting to deserialize object on CUDA device 0 but torch.cuda.device_count() is 0.

Nov 30 '22 14:11 mchaker

dalle-flow dalle-flow copied to clipboard

jina does not pass the right GPU in to clipseg

dalle-flow
dalle-flow copied to clipboard