aphrodite-engine icon indicating copy to clipboard operation
aphrodite-engine copied to clipboard

Add service user to Dockerfile

Open theobjectivedad opened this issue 8 months ago • 1 comments

This PR introduces several enhancements to the primary Dockerfile including:

  • Added a user to /etc/passwd to run aphrodite-engine
  • Separated the user home from the application home, this way all dotfiles are written to /home/aphrodite-engine and avoid permission issues
  • Moved all build arguments to the top of the file for readability
  • Added the ability to set the UID and GID of the service account at build time, useful for reading host-mounted volumes
  • Added the ability to set the aphrodite-engine branch at build time
  • Added apt-get clean to (slightly) reduce layer size
  • Added build argument for TORCH_CUDA_ARCH_LIST, added sm_87 to the default list (Jetson, DRIVE, Clara)
  • Parameterized ENTRYPOINT with APP_HOME
  • Formatting changes for consistency & readability

No application paths were changed as part of this enhancement to maintain backwards compatibility, however please be aware that /aphrodite-engine is now read only.

Motivation

When loading a GPTQ quantized version of CommandR+, aphrodite-engine exited after building the KV cache with the following message: "KeyError: 'getpwuid(): uid not found: 1000'". The root cause was ultimately running the process under a UID that didn't have a corresponding entry in /etc/passwd. I'm not clear why this wasn't happening with other models I've tested with.

(RayWorkerAphrodite pid=3988) INFO:     Model weights loaded. Memory usage: 13.97 GiB x 4 = 55.88 GiB [repeated 2x across cluster]
...
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/codegen/cuda_combined_scheduling.py", line 63, in codegen_nodes
[rank0]:     return self._triton_scheduling.codegen_nodes(nodes)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/codegen/triton.py", line 3255, in codegen_nodes
[rank0]:     return self.codegen_node_schedule(node_schedule, buf_accesses, numel, rnumel)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/codegen/triton.py", line 3427, in codegen_node_schedule
[rank0]:     kernel_name = self.define_kernel(src_code, node_schedule)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/codegen/triton.py", line 3537, in define_kernel
[rank0]:     basename, _, kernel_path = get_path(code_hash(src_code.strip()), "py")
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/codecache.py", line 349, in get_path
[rank0]:     subdir = os.path.join(cache_dir(), basename[1:3])
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/utils.py", line 739, in cache_dir
[rank0]:     sanitized_username = re.sub(r'[\\/:*?"<>|]', "_", getpass.getuser())
[rank0]:   File "/usr/lib/python3.10/getpass.py", line 169, in getuser
[rank0]:     return pwd.getpwuid(os.getuid())[0]
[rank0]: torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
[rank0]: KeyError: 'getpwuid(): uid not found: 1000'
[rank0]: Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information
[rank0]: You can suppress this exception and fall back to eager by setting:
[rank0]:     import torch._dynamo
[rank0]:     torch._dynamo.config.suppress_errors = True

As stated above, this introduces a better design where the service account user's home directory is different than the application directory. This allows any dotfiles (such as .cache) created by that user to reside outside the main application directory.

theobjectivedad avatar Jun 18 '24 15:06 theobjectivedad