backend.ai icon indicating copy to clipboard operation
backend.ai copied to clipboard

fix: Session creation failure due to wrong type check when using mock-accelerator

Open jopemachine opened this issue 8 months ago • 2 comments

Since DeviceId is a str type, mother_uuid should be defined as t.String.

https://github.com/lablup/backend.ai/blob/14996f2c8ea13301a6d57b1a7baa2e7dc7512b93/src/ai/backend/accelerator/mock/plugin.py#L104-L118

Currently, because mother_uuid is defined as tx.UUID, the following bug is occurring unintentionally when trying to creating session.

This PR prevents the following type of bug.

Client side

❯ ./backend.ai session create \
            -r cpu=1 -r mem=2g -r cuda.shares=14.2 \
            cr.backend.ai/testing/ngc-pytorch:23.10-pytorch2.1-py310-cuda12.2
✗ Session ID a8bdc7c6-64c4-473c-a80a-3c69b774fcfa has an error during scheduling/startup or cancelled.

Manager

2024-06-13 06:22:06.971 ERROR ai.backend.agent.server [52916] unexpected error
Traceback (most recent call last):
  File "/home/jopemachine/backend.ai/src/ai/backend/agent/server.py", line 167, in _inner
    return await meth(
           ^^^^^^^^^^^
  File "/home/jopemachine/backend.ai/src/ai/backend/agent/server.py", line 143, in _inner
    return await meth(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jopemachine/backend.ai/src/ai/backend/agent/server.py", line 536, in create_kernels
    raise errors[0]
  File "/home/jopemachine/backend.ai/src/ai/backend/agent/agent.py", line 1763, in create_kernel
    allocate(
  File "/home/jopemachine/backend.ai/src/ai/backend/agent/resources.py", line 606, in allocate
    resource_spec.allocations[dev_name] = computer_ctx.alloc_map.allocate(
                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jopemachine/backend.ai/src/ai/backend/agent/alloc_map.py", line 478, in allocate
    calculated_alloc_map = self._allocate_impl[self.allocation_strategy](
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jopemachine/backend.ai/src/ai/backend/agent/alloc_map.py", line 645, in _allocate_evenly
    sorted_dev_allocs = self.get_current_allocations(affinity_hint, slot_name)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jopemachine/backend.ai/src/ai/backend/agent/alloc_map.py", line 129, in get_current_allocations
    return sorted(
           ^^^^^^^
  File "/home/jopemachine/backend.ai/src/ai/backend/agent/alloc_map.py", line 131, in <lambda>
    key=lambda pair: self.device_slots[pair[0]].amount - pair[1],
                     ~~~~~~~~~~~~~~~~~^^^^^^^^^
KeyError: 'c59395cd-ac91-4cd3-a1b0-3d2568aa2d03'
2024-06-13 06:22:06.972 ERROR callosum.rpc.channel.Peer [52916] RPC user error
Traceback (most recent call last):
  File "/home/jopemachine/backend.ai/dist/export/python/virtualenvs/python-default/3.12.2/lib/python3.12/site-packages/callosum/rpc/channel.py", line 292, in _func_task
    result = await self._func_scheduler.get_fut(server_request_id)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jopemachine/backend.ai/dist/export/python/virtualenvs/python-default/3.12.2/lib/python3.12/site-packages/callosum/ordering.py", line 214, in get_fut
    return await task
           ^^^^^^^^^^
  File "/home/jopemachine/backend.ai/src/ai/backend/agent/server.py", line 167, in _inner
    return await meth(
           ^^^^^^^^^^^
  File "/home/jopemachine/backend.ai/src/ai/backend/agent/server.py", line 143, in _inner
    return await meth(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jopemachine/backend.ai/src/ai/backend/agent/server.py", line 536, in create_kernels
    raise errors[0]
  File "/home/jopemachine/backend.ai/src/ai/backend/agent/agent.py", line 1763, in create_kernel
    allocate(
  File "/home/jopemachine/backend.ai/src/ai/backend/agent/resources.py", line 606, in allocate
    resource_spec.allocations[dev_name] = computer_ctx.alloc_map.allocate(
                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jopemachine/backend.ai/src/ai/backend/agent/alloc_map.py", line 478, in allocate
    calculated_alloc_map = self._allocate_impl[self.allocation_strategy](
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jopemachine/backend.ai/src/ai/backend/agent/alloc_map.py", line 645, in _allocate_evenly
    sorted_dev_allocs = self.get_current_allocations(affinity_hint, slot_name)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jopemachine/backend.ai/src/ai/backend/agent/alloc_map.py", line 129, in get_current_allocations
    return sorted(
           ^^^^^^^
  File "/home/jopemachine/backend.ai/src/ai/backend/agent/alloc_map.py", line 131, in <lambda>
    key=lambda pair: self.device_slots[pair[0]].amount - pair[1],
                     ~~~~~~~~~~~~~~~~~^^^^^^^^^
KeyError: 'c59395cd-ac91-4cd3-a1b0-3d2568aa2d03'

Checklist: (if applicable)

  • [ ] Milestone metadata specifying the target backport version
  • [ ] Mention to the original issue
  • [ ] Installer updates including:
    • Fixtures for db schema changes
    • New mandatory config options
  • [ ] Update of end-to-end CLI integration tests in ai.backend.test
  • [ ] API server-client counterparts (e.g., manager API -> client SDK)
  • [ ] Test case(s) to:
    • Demonstrate the difference of before/after
    • Demonstrate the flow of abstract/conceptual models with a concrete implementation
  • [ ] Documentation
    • Contents in the docs directory
    • docstrings in public interfaces and type annotations

jopemachine avatar Jun 12 '24 05:06 jopemachine