backend.ai
backend.ai copied to clipboard
fix: Session creation failure due to wrong type check when using mock-accelerator
Since DeviceId
is a str type, mother_uuid should be defined as t.String
.
https://github.com/lablup/backend.ai/blob/14996f2c8ea13301a6d57b1a7baa2e7dc7512b93/src/ai/backend/accelerator/mock/plugin.py#L104-L118
Currently, because mother_uuid
is defined as tx.UUID
, the following bug is occurring unintentionally when trying to creating session.
This PR prevents the following type of bug.
Client side
❯ ./backend.ai session create \
-r cpu=1 -r mem=2g -r cuda.shares=14.2 \
cr.backend.ai/testing/ngc-pytorch:23.10-pytorch2.1-py310-cuda12.2
✗ Session ID a8bdc7c6-64c4-473c-a80a-3c69b774fcfa has an error during scheduling/startup or cancelled.
Manager
2024-06-13 06:22:06.971 ERROR ai.backend.agent.server [52916] unexpected error
Traceback (most recent call last):
File "/home/jopemachine/backend.ai/src/ai/backend/agent/server.py", line 167, in _inner
return await meth(
^^^^^^^^^^^
File "/home/jopemachine/backend.ai/src/ai/backend/agent/server.py", line 143, in _inner
return await meth(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jopemachine/backend.ai/src/ai/backend/agent/server.py", line 536, in create_kernels
raise errors[0]
File "/home/jopemachine/backend.ai/src/ai/backend/agent/agent.py", line 1763, in create_kernel
allocate(
File "/home/jopemachine/backend.ai/src/ai/backend/agent/resources.py", line 606, in allocate
resource_spec.allocations[dev_name] = computer_ctx.alloc_map.allocate(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jopemachine/backend.ai/src/ai/backend/agent/alloc_map.py", line 478, in allocate
calculated_alloc_map = self._allocate_impl[self.allocation_strategy](
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jopemachine/backend.ai/src/ai/backend/agent/alloc_map.py", line 645, in _allocate_evenly
sorted_dev_allocs = self.get_current_allocations(affinity_hint, slot_name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jopemachine/backend.ai/src/ai/backend/agent/alloc_map.py", line 129, in get_current_allocations
return sorted(
^^^^^^^
File "/home/jopemachine/backend.ai/src/ai/backend/agent/alloc_map.py", line 131, in <lambda>
key=lambda pair: self.device_slots[pair[0]].amount - pair[1],
~~~~~~~~~~~~~~~~~^^^^^^^^^
KeyError: 'c59395cd-ac91-4cd3-a1b0-3d2568aa2d03'
2024-06-13 06:22:06.972 ERROR callosum.rpc.channel.Peer [52916] RPC user error
Traceback (most recent call last):
File "/home/jopemachine/backend.ai/dist/export/python/virtualenvs/python-default/3.12.2/lib/python3.12/site-packages/callosum/rpc/channel.py", line 292, in _func_task
result = await self._func_scheduler.get_fut(server_request_id)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jopemachine/backend.ai/dist/export/python/virtualenvs/python-default/3.12.2/lib/python3.12/site-packages/callosum/ordering.py", line 214, in get_fut
return await task
^^^^^^^^^^
File "/home/jopemachine/backend.ai/src/ai/backend/agent/server.py", line 167, in _inner
return await meth(
^^^^^^^^^^^
File "/home/jopemachine/backend.ai/src/ai/backend/agent/server.py", line 143, in _inner
return await meth(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jopemachine/backend.ai/src/ai/backend/agent/server.py", line 536, in create_kernels
raise errors[0]
File "/home/jopemachine/backend.ai/src/ai/backend/agent/agent.py", line 1763, in create_kernel
allocate(
File "/home/jopemachine/backend.ai/src/ai/backend/agent/resources.py", line 606, in allocate
resource_spec.allocations[dev_name] = computer_ctx.alloc_map.allocate(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jopemachine/backend.ai/src/ai/backend/agent/alloc_map.py", line 478, in allocate
calculated_alloc_map = self._allocate_impl[self.allocation_strategy](
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jopemachine/backend.ai/src/ai/backend/agent/alloc_map.py", line 645, in _allocate_evenly
sorted_dev_allocs = self.get_current_allocations(affinity_hint, slot_name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jopemachine/backend.ai/src/ai/backend/agent/alloc_map.py", line 129, in get_current_allocations
return sorted(
^^^^^^^
File "/home/jopemachine/backend.ai/src/ai/backend/agent/alloc_map.py", line 131, in <lambda>
key=lambda pair: self.device_slots[pair[0]].amount - pair[1],
~~~~~~~~~~~~~~~~~^^^^^^^^^
KeyError: 'c59395cd-ac91-4cd3-a1b0-3d2568aa2d03'
Checklist: (if applicable)
- [ ] Milestone metadata specifying the target backport version
- [ ] Mention to the original issue
- [ ] Installer updates including:
- Fixtures for db schema changes
- New mandatory config options
- [ ] Update of end-to-end CLI integration tests in
ai.backend.test
- [ ] API server-client counterparts (e.g., manager API -> client SDK)
- [ ] Test case(s) to:
- Demonstrate the difference of before/after
- Demonstrate the flow of abstract/conceptual models with a concrete implementation
- [ ] Documentation
- Contents in the
docs
directory - docstrings in public interfaces and type annotations
- Contents in the