aphrodite-engine
aphrodite-engine copied to clipboard
Add service user to Dockerfile
This PR introduces several enhancements to the primary Dockerfile
including:
- Added a user to
/etc/passwd
to run aphrodite-engine - Separated the user home from the application home, this way all dotfiles are written to
/home/aphrodite-engine
and avoid permission issues - Moved all build arguments to the top of the file for readability
- Added the ability to set the UID and GID of the service account at build time, useful for reading host-mounted volumes
- Added the ability to set the aphrodite-engine branch at build time
- Added
apt-get clean
to (slightly) reduce layer size - Added build argument for TORCH_CUDA_ARCH_LIST, added sm_87 to the default list (Jetson, DRIVE, Clara)
- Parameterized
ENTRYPOINT
with APP_HOME - Formatting changes for consistency & readability
No application paths were changed as part of this enhancement to maintain backwards compatibility, however please be aware that /aphrodite-engine
is now read only.
Motivation
When loading a GPTQ quantized version of CommandR+, aphrodite-engine exited after building the KV cache with the following message: "KeyError: 'getpwuid(): uid not found: 1000'". The root cause was ultimately running the process under a UID that didn't have a corresponding entry in /etc/passwd
. I'm not clear why this wasn't happening with other models I've tested with.
(RayWorkerAphrodite pid=3988) INFO: Model weights loaded. Memory usage: 13.97 GiB x 4 = 55.88 GiB [repeated 2x across cluster]
...
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/codegen/cuda_combined_scheduling.py", line 63, in codegen_nodes
[rank0]: return self._triton_scheduling.codegen_nodes(nodes)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/codegen/triton.py", line 3255, in codegen_nodes
[rank0]: return self.codegen_node_schedule(node_schedule, buf_accesses, numel, rnumel)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/codegen/triton.py", line 3427, in codegen_node_schedule
[rank0]: kernel_name = self.define_kernel(src_code, node_schedule)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/codegen/triton.py", line 3537, in define_kernel
[rank0]: basename, _, kernel_path = get_path(code_hash(src_code.strip()), "py")
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/codecache.py", line 349, in get_path
[rank0]: subdir = os.path.join(cache_dir(), basename[1:3])
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/utils.py", line 739, in cache_dir
[rank0]: sanitized_username = re.sub(r'[\\/:*?"<>|]', "_", getpass.getuser())
[rank0]: File "/usr/lib/python3.10/getpass.py", line 169, in getuser
[rank0]: return pwd.getpwuid(os.getuid())[0]
[rank0]: torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
[rank0]: KeyError: 'getpwuid(): uid not found: 1000'
[rank0]: Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information
[rank0]: You can suppress this exception and fall back to eager by setting:
[rank0]: import torch._dynamo
[rank0]: torch._dynamo.config.suppress_errors = True
As stated above, this introduces a better design where the service account user's home directory is different than the application directory. This allows any dotfiles (such as .cache
) created by that user to reside outside the main application directory.