GenAIExamples icon indicating copy to clipboard operation
GenAIExamples copied to clipboard

Minor fixes for CodeGen Xeon and Gaudi Kubernetes codegen.yaml and doc updates

Open dmsuehir opened this issue 1 year ago • 0 comments

Description

This PR has a few updates based on issues that I ran into when deploying the CodeGen example on a cluster for xeon and Gaudi. The following issues are addressed in the PR:

  • I added a note about potentially using a persistent volume claim instead of having to create the /mnt/opea-models directory on the nodes
  • Deploying the codegen.yaml files gave an error like:
    error: error validating "codegen.yaml": error validating data: [unknown object type "nil" in ConfigMap.data.http_proxy, unknown object type "nil" in ConfigMap.data.https_proxy, unknown object type "nil" in ConfigMap.data.no_proxy]; if you choose to ignore these errors, turn validation off with --validate=false
    
    This error is because the ConfigMap in the yaml has a few env vars that are just empty (nil). Changing these to have empty quotes "" fixes the issue. [EDIT: this was resolved in PR 630]
  • I added a note about it taking a couple of minutes for the service to start and how to check the logs, because I ran into an issue where the curl command failed like "curl: (18) transfer closed with outstanding read data remaining" and it was just because the service wasn't ready yet. Also, knowing how to check the logs is useful for watching the status and figuring out if the curl command is failing because of an error.
  • When running on Gaudi wasn't working for me ("RuntimeError: synStatus=26 [Generic failure] Device acquire failed.") until I added the hugepages-2Mi/memory to the resource limits. The habana documentation for Kubernetes shows it using hugepages-2Mi and memory in the resources, so that seems to be the recommended config.

Issues

N/A

Type of change

List the type of change like below. Please delete options that are not relevant.

  • [x] Bug fix (non-breaking change which fixes an issue)
  • [x] Others (enhancement, documentation, validation, etc.)

Dependencies

N/A

Tests

Manually tested the changes on a Kubernetes cluster with Xeon and Gaudi 2 nodes.

dmsuehir avatar Aug 16 '24 16:08 dmsuehir