ome icon indicating copy to clipboard operation
ome copied to clipboard

Task 4: BaseModel Controller PVC Handling

Open slin1237 opened this issue 5 months ago • 4 comments

Task 4: BaseModel Controller PVC Handling

Overview

Update the BaseModel controller to handle the complete PVC storage flow: validation, job creation, and status management. This replaces the original design where model agent was involved.

Scope

  • Add PVC validation logic to controller reconciliation
  • Create metadata extraction jobs for PVC models
  • Monitor job status and update BaseModel accordingly
  • Handle all PVC-related errors and edge cases

Files to Modify

  • pkg/controller/v1beta1/basemodel/controller.go - Main implementation
  • config/rbac/role.yaml - Add PVC and Job permissions
  • pkg/controller/v1beta1/basemodel/controller_test.go - Tests

Implementation Details

Key Functions to Add

  1. PVC Detection in Reconcile Loop

    • Check if BaseModel uses PVC storage URI
    • Route to PVC-specific reconciliation logic
    • Maintain existing behavior for non-PVC models
  2. PVC Validation

    • Parse PVC URI to extract components
    • Verify PVC exists in the same namespace
    • Check PVC binding status
    • Update BaseModel status with validation results
  3. Metadata Extraction Job Management

    • Create Kubernetes Job for metadata extraction
    • Job should mount PVC read-only
    • Monitor job status (running, succeeded, failed)
    • Handle job lifecycle (creation, monitoring, cleanup)
  4. Status Updates

    • Update BaseModel status based on PVC and job states
    • Store PVC information in status annotations
    • Clear node-related status fields (not applicable for PVC)
    • Handle error messages and failure reasons

Job Specification Requirements

The metadata extraction job should:

  • Use the ome-agent image with model-metadata command
  • Mount the PVC at a specific path with subpath support
  • Run with appropriate resource limits
  • Have a reasonable timeout (e.g., 5 minutes)
  • Include TTL for automatic cleanup
  • Use dedicated ServiceAccount with minimal permissions

Controller Watch Configuration

Update the controller to:

  • Watch owned Jobs for status updates
  • Filter ConfigMap watches to exclude PVC models
  • Add appropriate event handlers

Test Cases

  1. Valid PVC Flow:

    • PVC exists and is bound
    • Job created successfully
    • Metadata extracted and updated
    • Status becomes Ready
  2. PVC Validation:

    • PVC not found
    • PVC not bound
    • Invalid PVC URI
  3. Job Management:

    • Job creation
    • Job success handling
    • Job failure handling
    • Job already exists
  4. Edge Cases:

    • Metadata already populated
    • Job timeout
    • Controller restart during job

Acceptance Criteria

  • [ ] Controller validates PVC existence and binding
  • [ ] Creates metadata extraction job with correct spec
  • [ ] Monitors job status and updates BaseModel accordingly
  • [ ] Handles all error cases gracefully
  • [ ] Status reflects PVC storage accurately
  • [ ] No interference with non-PVC models
  • [ ] Comprehensive test coverage

Dependencies

  • Task 1: Storage URI Parsing
  • Task 3: Model Metadata Agent

Estimated Effort

5-6 hours (increased due to complete flow ownership)

slin1237 avatar Jul 11 '25 19:07 slin1237

/assign

slin1237 avatar Jul 14 '25 10:07 slin1237

Hi I'm interested in this contributing to this project! Could I work on this issue?

kris-gaudel avatar Aug 01 '25 05:08 kris-gaudel

Hello, I'm currently using the following YAML file to create a ClusterBaseModel, but it remains in this state. Is it that the feature for automatically analyzing model-related information is not supported yet?

apiVersion: ome.io/v1beta1
kind: ClusterBaseModel
metadata:
  name: llama-3-2-1b-instruct
spec:
  displayName: meta.llama-3.2-1b-instruct
  vendor: meta
  disabled: false
  version: "1.0.0"
  storage:
    storageUri: pvc://default:models-pvc/Llama-3.2-1B-Instruct
    path: /raid/models/meta/llama-3-2-1b-instruct

pv,pvc.yaml

apiVersion: v1
kind: PersistentVolume
metadata:
  name: models-pv
spec:
  capacity:
    storage: 10Gi
  volumeMode: Filesystem
  accessModes:
    - ReadWriteMany
  persistentVolumeReclaimPolicy: Retain
  storageClassName: models-sc
  nfs:
    server: 172.16.10.101
    path: /nfs/models
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: models-pvc
spec:
  volumeName: models-pv
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 10Gi
  storageClassName: models-sc

nfs dir

ls /nfs/models/Llama3.2-1B-Instruct/
checklist.chk  consolidated.00.pth  params.json  tokenizer.model

ome-controller-manager logs

2025-08-06T18:19:25.184487143+08:00 2025-08-06T10:19:25Z	INFO	ClusterBaseModel	basemodel/controller.go:116	Reconciling ClusterBaseModel	{"clusterbasemodel": {"name":"llama-3-2-1b-instruct"}}
2025-08-06T18:19:25.184496518+08:00 2025-08-06T10:19:25Z	INFO	ClusterBaseModel	basemodel/controller.go:126	Adding finalizer to ClusterBaseModel	{"clusterbasemodel": {"name":"llama-3-2-1b-instruct"}}
2025-08-06T18:19:25.188545935+08:00 2025-08-06T10:19:25Z	INFO	ClusterBaseModel	basemodel/controller.go:294	Processing model status from ConfigMaps	{"model": "llama-3-2-1b-instruct", "configMapsTotal": 0}
2025-08-06T18:19:25.188555428+08:00 2025-08-06T10:19:25Z	INFO	ClusterBaseModel	basemodel/controller.go:366	Model status summary	{"model": "llama-3-2-1b-instruct", "readyNodes": 0, "failedNodes": 0, "totalProcessed": 0, "validNodes": 0}
2025-08-06T18:19:25.193595754+08:00 2025-08-06T10:19:25Z	INFO	ClusterBaseModel	basemodel/controller.go:621	Updated ClusterBaseModel status	{"nodesReady": 0, "nodesFailed": 0, "state": "In_Transit"}
2025-08-06T18:19:25.193609615+08:00 2025-08-06T10:19:25Z	INFO	ClusterBaseModel	basemodel/controller.go:116	Reconciling ClusterBaseModel	{"clusterbasemodel": {"name":"llama-3-2-1b-instruct"}}
2025-08-06T18:19:25.193611944+08:00 2025-08-06T10:19:25Z	INFO	ClusterBaseModel	basemodel/controller.go:294	Processing model status from ConfigMaps	{"model": "llama-3-2-1b-instruct", "configMapsTotal": 0}
2025-08-06T18:19:25.193613949+08:00 2025-08-06T10:19:25Z	INFO	ClusterBaseModel	basemodel/controller.go:366	Model status summary	{"model": "llama-3-2-1b-instruct", "readyNodes": 0, "failedNodes": 0, "totalProcessed": 0, "validNodes": 0}
2025-08-06T18:19:25.296159236+08:00 2025-08-06T10:19:25Z	INFO	ClusterBaseModel	basemodel/controller.go:116	Reconciling ClusterBaseModel	{"clusterbasemodel": {"name":"llama-3-2-1b-instruct"}}
2025-08-06T18:19:25.296172817+08:00 2025-08-06T10:19:25Z	INFO	ClusterBaseModel	basemodel/controller.go:294	Processing model status from ConfigMaps	{"model": "llama-3-2-1b-instruct", "configMapsTotal": 0}
2025-08-06T18:19:25.296175260+08:00 2025-08-06T10:19:25Z	INFO	ClusterBaseModel	basemodel/controller.go:366	Model status summary	{"model": "llama-3-2-1b-instruct", "readyNodes": 0, "failedNodes": 0, "totalProcessed": 0, "validNodes": 0}

ome-model-agent-daemonset logs

2025-08-06T17:05:32.737182634+08:00 2025-08-06T09:05:32.737Z	WARN	modelagent/configmap_reconciler.go:148	ConfigMap not found during reconciliation, will recreate it
2025-08-06T17:05:32.737203537+08:00 2025-08-06T09:05:32.737Z	INFO	modelagent/configmap_reconciler.go:188	No models in cache to recreate ConfigMap
2025-08-06T17:10:32.736348254+08:00 2025-08-06T09:10:32.736Z	WARN	modelagent/configmap_reconciler.go:148	ConfigMap not found during reconciliation, will recreate it
2025-08-06T17:10:32.736370479+08:00 2025-08-06T09:10:32.736Z	INFO	modelagent/configmap_reconciler.go:188	No models in cache to recreate ConfigMap
2025-08-06T17:15:32.736261904+08:00 2025-08-06T09:15:32.736Z	WARN	modelagent/configmap_reconciler.go:148	ConfigMap not found during reconciliation, will recreate it
2025-08-06T17:15:32.736296998+08:00 2025-08-06T09:15:32.736Z	INFO	modelagent/configmap_reconciler.go:188	No models in cache to recreate ConfigMap
2025-08-06T17:20:32.736336190+08:00 2025-08-06T09:20:32.736Z	WARN	modelagent/configmap_reconciler.go:148	ConfigMap not found during reconciliation, will recreate it
2025-08-06T17:20:32.736358392+08:00 2025-08-06T09:20:32.736Z	INFO	modelagent/configmap_reconciler.go:188	No models in cache to recreate ConfigMap
2025-08-06T17:25:32.737016463+08:00 2025-08-06T09:25:32.736Z	WARN	modelagent/configmap_reconciler.go:148	ConfigMap not found during reconciliation, will recreate it
2025-08-06T17:25:32.737039936+08:00 2025-08-06T09:25:32.736Z	INFO	modelagent/configmap_reconciler.go:188	No models in cache to recreate ConfigMap
2025-08-06T17:30:32.736224203+08:00 2025-08-06T09:30:32.736Z	WARN	modelagent/configmap_reconciler.go:148	ConfigMap not found during reconciliation, will recreate it
2025-08-06T17:30:32.736247386+08:00 2025-08-06T09:30:32.736Z	INFO	modelagent/configmap_reconciler.go:188	No models in cache to recreate ConfigMap
2025-08-06T17:35:32.737056996+08:00 2025-08-06T09:35:32.736Z	WARN	modelagent/configmap_reconciler.go:148	ConfigMap not found during reconciliation, will recreate it
2025-08-06T17:35:32.737081016+08:00 2025-08-06T09:35:32.736Z	INFO	modelagent/configmap_reconciler.go:188	No models in cache to recreate ConfigMap
2025-08-06T17:40:32.736542056+08:00 2025-08-06T09:40:32.736Z	WARN	modelagent/configmap_reconciler.go:148	ConfigMap not found during reconciliation, will recreate it
2025-08-06T17:40:32.736565192+08:00 2025-08-06T09:40:32.736Z	INFO	modelagent/configmap_reconciler.go:188	No models in cache to recreate ConfigMap
2025-08-06T17:45:32.736799727+08:00 2025-08-06T09:45:32.736Z	WARN	modelagent/configmap_reconciler.go:148	ConfigMap not found during reconciliation, will recreate it
2025-08-06T17:45:32.736823696+08:00 2025-08-06T09:45:32.736Z	INFO	modelagent/configmap_reconciler.go:188	No models in cache to recreate ConfigMap
2025-08-06T17:50:32.736843340+08:00 2025-08-06T09:50:32.736Z	WARN	modelagent/configmap_reconciler.go:148	ConfigMap not found during reconciliation, will recreate it
2025-08-06T17:50:32.736866800+08:00 2025-08-06T09:50:32.736Z	INFO	modelagent/configmap_reconciler.go:188	No models in cache to recreate ConfigMap
2025-08-06T17:55:32.737047884+08:00 2025-08-06T09:55:32.736Z	WARN	modelagent/configmap_reconciler.go:148	ConfigMap not found during reconciliation, will recreate it
2025-08-06T17:55:32.737070003+08:00 2025-08-06T09:55:32.736Z	INFO	modelagent/configmap_reconciler.go:188	No models in cache to recreate ConfigMap
2025-08-06T18:00:32.736803383+08:00 2025-08-06T10:00:32.736Z	WARN	modelagent/configmap_reconciler.go:148	ConfigMap not found during reconciliation, will recreate it
2025-08-06T18:00:32.736825479+08:00 2025-08-06T10:00:32.736Z	INFO	modelagent/configmap_reconciler.go:188	No models in cache to recreate ConfigMap
2025-08-06T18:05:32.736819589+08:00 2025-08-06T10:05:32.736Z	WARN	modelagent/configmap_reconciler.go:148	ConfigMap not found during reconciliation, will recreate it
2025-08-06T18:05:32.736840719+08:00 2025-08-06T10:05:32.736Z	INFO	modelagent/configmap_reconciler.go:188	No models in cache to recreate ConfigMap
2025-08-06T18:10:32.736399272+08:00 2025-08-06T10:10:32.736Z	WARN	modelagent/configmap_reconciler.go:148	ConfigMap not found during reconciliation, will recreate it
2025-08-06T18:10:32.736422559+08:00 2025-08-06T10:10:32.736Z	INFO	modelagent/configmap_reconciler.go:188	No models in cache to recreate ConfigMap
2025-08-06T18:15:32.736933866+08:00 2025-08-06T10:15:32.736Z	WARN	modelagent/configmap_reconciler.go:148	ConfigMap not found during reconciliation, will recreate it
2025-08-06T18:15:32.736963694+08:00 2025-08-06T10:15:32.736Z	INFO	modelagent/configmap_reconciler.go:188	No models in cache to recreate ConfigMap
2025-08-06T18:19:25.188842029+08:00 2025-08-06T10:19:25.188Z	INFO	modelagent/scout.go:209	Processing ClusterBaseModel: llama-3-2-1b-instruct
2025-08-06T18:20:32.737077014+08:00 2025-08-06T10:20:32.736Z	WARN	modelagent/configmap_reconciler.go:148	ConfigMap not found during reconciliation, will recreate it
2025-08-06T18:20:32.737105731+08:00 2025-08-06T10:20:32.736Z	INFO	modelagent/configmap_reconciler.go:188	No models in cache to recreate ConfigMap
2025-08-06T18:25:32.736536435+08:00 2025-08-06T10:25:32.736Z	WARN	modelagent/configmap_reconciler.go:148	ConfigMap not found during reconciliation, will recreate it
2025-08-06T18:25:32.736557762+08:00 2025-08-06T10:25:32.736Z	INFO	modelagent/configmap_reconciler.go:188	No models in cache to recreate ConfigMap
2025-08-06T18:30:32.737317780+08:00 2025-08-06T10:30:32.737Z	WARN	modelagent/configmap_reconciler.go:148	ConfigMap not found during reconciliation, will recreate it
2025-08-06T18:30:32.737341969+08:00 2025-08-06T10:30:32.737Z	INFO	modelagent/configmap_reconciler.go:188	No models in cache to recreate ConfigMap
2025-08-06T18:35:32.736548034+08:00 2025-08-06T10:35:32.736Z	WARN	modelagent/configmap_reconciler.go:148	ConfigMap not found during reconciliation, will recreate it
2025-08-06T18:35:32.736572578+08:00 2025-08-06T10:35:32.736Z	INFO	modelagent/configmap_reconciler.go:188	No models in cache to recreate ConfigMap
2025-08-06T18:40:32.736435195+08:00 2025-08-06T10:40:32.736Z	WARN	modelagent/configmap_reconciler.go:148	ConfigMap not found during reconciliation, will recreate it
2025-08-06T18:40:32.736458866+08:00 2025-08-06T10:40:32.736Z	INFO	modelagent/configmap_reconciler.go:188	No models in cache to recreate ConfigMap
2025-08-06T18:45:32.736627834+08:00 2025-08-06T10:45:32.736Z	WARN	modelagent/configmap_reconciler.go:148	ConfigMap not found during reconciliation, will recreate it
2025-08-06T18:45:32.736651967+08:00 2025-08-06T10:45:32.736Z	INFO	modelagent/configmap_reconciler.go:188	No models in cache to recreate ConfigMap

clusterbasemodel statu

kubectl get clusterbasemodel
NAME                    DISABLED   VERSION   VENDOR   FRAMEWORK   FRAMEWORKVERSION   MODELFORMAT   ARCHITECTURE   CAPABILITIES   SIZE   COMPARTMENTID   READY        AGE
llama-3-2-1b-instruct   false      1.0.0     meta                                                                                                       In_Transit   29m

mupeifeiyi avatar Aug 06 '25 10:08 mupeifeiyi

Hello, I'm currently using the following YAML file to create a ClusterBaseModel, but it remains in this state. Is it that the feature for automatically analyzing model-related information is not supported yet?

apiVersion: ome.io/v1beta1 kind: ClusterBaseModel metadata: name: llama-3-2-1b-instruct spec: displayName: meta.llama-3.2-1b-instruct vendor: meta disabled: false version: "1.0.0" storage: storageUri: pvc://default:models-pvc/Llama-3.2-1B-Instruct path: /raid/models/meta/llama-3-2-1b-instruct pv,pvc.yaml

apiVersion: v1 kind: PersistentVolume metadata: name: models-pv spec: capacity: storage: 10Gi volumeMode: Filesystem accessModes: - ReadWriteMany persistentVolumeReclaimPolicy: Retain storageClassName: models-sc nfs: server: 172.16.10.101 path: /nfs/models

apiVersion: v1 kind: PersistentVolumeClaim metadata: name: models-pvc spec: volumeName: models-pv accessModes: - ReadWriteMany resources: requests: storage: 10Gi storageClassName: models-sc nfs dir

ls /nfs/models/Llama3.2-1B-Instruct/ checklist.chk consolidated.00.pth params.json tokenizer.model ome-controller-manager logs

2025-08-06T18:19:25.184487143+08:00 2025-08-06T10:19:25Z INFO ClusterBaseModel basemodel/controller.go:116 Reconciling ClusterBaseModel {"clusterbasemodel": {"name":"llama-3-2-1b-instruct"}} 2025-08-06T18:19:25.184496518+08:00 2025-08-06T10:19:25Z INFO ClusterBaseModel basemodel/controller.go:126 Adding finalizer to ClusterBaseModel {"clusterbasemodel": {"name":"llama-3-2-1b-instruct"}} 2025-08-06T18:19:25.188545935+08:00 2025-08-06T10:19:25Z INFO ClusterBaseModel basemodel/controller.go:294 Processing model status from ConfigMaps {"model": "llama-3-2-1b-instruct", "configMapsTotal": 0} 2025-08-06T18:19:25.188555428+08:00 2025-08-06T10:19:25Z INFO ClusterBaseModel basemodel/controller.go:366 Model status summary {"model": "llama-3-2-1b-instruct", "readyNodes": 0, "failedNodes": 0, "totalProcessed": 0, "validNodes": 0} 2025-08-06T18:19:25.193595754+08:00 2025-08-06T10:19:25Z INFO ClusterBaseModel basemodel/controller.go:621 Updated ClusterBaseModel status {"nodesReady": 0, "nodesFailed": 0, "state": "In_Transit"} 2025-08-06T18:19:25.193609615+08:00 2025-08-06T10:19:25Z INFO ClusterBaseModel basemodel/controller.go:116 Reconciling ClusterBaseModel {"clusterbasemodel": {"name":"llama-3-2-1b-instruct"}} 2025-08-06T18:19:25.193611944+08:00 2025-08-06T10:19:25Z INFO ClusterBaseModel basemodel/controller.go:294 Processing model status from ConfigMaps {"model": "llama-3-2-1b-instruct", "configMapsTotal": 0} 2025-08-06T18:19:25.193613949+08:00 2025-08-06T10:19:25Z INFO ClusterBaseModel basemodel/controller.go:366 Model status summary {"model": "llama-3-2-1b-instruct", "readyNodes": 0, "failedNodes": 0, "totalProcessed": 0, "validNodes": 0} 2025-08-06T18:19:25.296159236+08:00 2025-08-06T10:19:25Z INFO ClusterBaseModel basemodel/controller.go:116 Reconciling ClusterBaseModel {"clusterbasemodel": {"name":"llama-3-2-1b-instruct"}} 2025-08-06T18:19:25.296172817+08:00 2025-08-06T10:19:25Z INFO ClusterBaseModel basemodel/controller.go:294 Processing model status from ConfigMaps {"model": "llama-3-2-1b-instruct", "configMapsTotal": 0} 2025-08-06T18:19:25.296175260+08:00 2025-08-06T10:19:25Z INFO ClusterBaseModel basemodel/controller.go:366 Model status summary {"model": "llama-3-2-1b-instruct", "readyNodes": 0, "failedNodes": 0, "totalProcessed": 0, "validNodes": 0} ome-model-agent-daemonset logs

2025-08-06T17:05:32.737182634+08:00 2025-08-06T09:05:32.737Z WARN modelagent/configmap_reconciler.go:148 ConfigMap not found during reconciliation, will recreate it 2025-08-06T17:05:32.737203537+08:00 2025-08-06T09:05:32.737Z INFO modelagent/configmap_reconciler.go:188 No models in cache to recreate ConfigMap 2025-08-06T17:10:32.736348254+08:00 2025-08-06T09:10:32.736Z WARN modelagent/configmap_reconciler.go:148 ConfigMap not found during reconciliation, will recreate it 2025-08-06T17:10:32.736370479+08:00 2025-08-06T09:10:32.736Z INFO modelagent/configmap_reconciler.go:188 No models in cache to recreate ConfigMap 2025-08-06T17:15:32.736261904+08:00 2025-08-06T09:15:32.736Z WARN modelagent/configmap_reconciler.go:148 ConfigMap not found during reconciliation, will recreate it 2025-08-06T17:15:32.736296998+08:00 2025-08-06T09:15:32.736Z INFO modelagent/configmap_reconciler.go:188 No models in cache to recreate ConfigMap 2025-08-06T17:20:32.736336190+08:00 2025-08-06T09:20:32.736Z WARN modelagent/configmap_reconciler.go:148 ConfigMap not found during reconciliation, will recreate it 2025-08-06T17:20:32.736358392+08:00 2025-08-06T09:20:32.736Z INFO modelagent/configmap_reconciler.go:188 No models in cache to recreate ConfigMap 2025-08-06T17:25:32.737016463+08:00 2025-08-06T09:25:32.736Z WARN modelagent/configmap_reconciler.go:148 ConfigMap not found during reconciliation, will recreate it 2025-08-06T17:25:32.737039936+08:00 2025-08-06T09:25:32.736Z INFO modelagent/configmap_reconciler.go:188 No models in cache to recreate ConfigMap 2025-08-06T17:30:32.736224203+08:00 2025-08-06T09:30:32.736Z WARN modelagent/configmap_reconciler.go:148 ConfigMap not found during reconciliation, will recreate it 2025-08-06T17:30:32.736247386+08:00 2025-08-06T09:30:32.736Z INFO modelagent/configmap_reconciler.go:188 No models in cache to recreate ConfigMap 2025-08-06T17:35:32.737056996+08:00 2025-08-06T09:35:32.736Z WARN modelagent/configmap_reconciler.go:148 ConfigMap not found during reconciliation, will recreate it 2025-08-06T17:35:32.737081016+08:00 2025-08-06T09:35:32.736Z INFO modelagent/configmap_reconciler.go:188 No models in cache to recreate ConfigMap 2025-08-06T17:40:32.736542056+08:00 2025-08-06T09:40:32.736Z WARN modelagent/configmap_reconciler.go:148 ConfigMap not found during reconciliation, will recreate it 2025-08-06T17:40:32.736565192+08:00 2025-08-06T09:40:32.736Z INFO modelagent/configmap_reconciler.go:188 No models in cache to recreate ConfigMap 2025-08-06T17:45:32.736799727+08:00 2025-08-06T09:45:32.736Z WARN modelagent/configmap_reconciler.go:148 ConfigMap not found during reconciliation, will recreate it 2025-08-06T17:45:32.736823696+08:00 2025-08-06T09:45:32.736Z INFO modelagent/configmap_reconciler.go:188 No models in cache to recreate ConfigMap 2025-08-06T17:50:32.736843340+08:00 2025-08-06T09:50:32.736Z WARN modelagent/configmap_reconciler.go:148 ConfigMap not found during reconciliation, will recreate it 2025-08-06T17:50:32.736866800+08:00 2025-08-06T09:50:32.736Z INFO modelagent/configmap_reconciler.go:188 No models in cache to recreate ConfigMap 2025-08-06T17:55:32.737047884+08:00 2025-08-06T09:55:32.736Z WARN modelagent/configmap_reconciler.go:148 ConfigMap not found during reconciliation, will recreate it 2025-08-06T17:55:32.737070003+08:00 2025-08-06T09:55:32.736Z INFO modelagent/configmap_reconciler.go:188 No models in cache to recreate ConfigMap 2025-08-06T18:00:32.736803383+08:00 2025-08-06T10:00:32.736Z WARN modelagent/configmap_reconciler.go:148 ConfigMap not found during reconciliation, will recreate it 2025-08-06T18:00:32.736825479+08:00 2025-08-06T10:00:32.736Z INFO modelagent/configmap_reconciler.go:188 No models in cache to recreate ConfigMap 2025-08-06T18:05:32.736819589+08:00 2025-08-06T10:05:32.736Z WARN modelagent/configmap_reconciler.go:148 ConfigMap not found during reconciliation, will recreate it 2025-08-06T18:05:32.736840719+08:00 2025-08-06T10:05:32.736Z INFO modelagent/configmap_reconciler.go:188 No models in cache to recreate ConfigMap 2025-08-06T18:10:32.736399272+08:00 2025-08-06T10:10:32.736Z WARN modelagent/configmap_reconciler.go:148 ConfigMap not found during reconciliation, will recreate it 2025-08-06T18:10:32.736422559+08:00 2025-08-06T10:10:32.736Z INFO modelagent/configmap_reconciler.go:188 No models in cache to recreate ConfigMap 2025-08-06T18:15:32.736933866+08:00 2025-08-06T10:15:32.736Z WARN modelagent/configmap_reconciler.go:148 ConfigMap not found during reconciliation, will recreate it 2025-08-06T18:15:32.736963694+08:00 2025-08-06T10:15:32.736Z INFO modelagent/configmap_reconciler.go:188 No models in cache to recreate ConfigMap 2025-08-06T18:19:25.188842029+08:00 2025-08-06T10:19:25.188Z INFO modelagent/scout.go:209 Processing ClusterBaseModel: llama-3-2-1b-instruct 2025-08-06T18:20:32.737077014+08:00 2025-08-06T10:20:32.736Z WARN modelagent/configmap_reconciler.go:148 ConfigMap not found during reconciliation, will recreate it 2025-08-06T18:20:32.737105731+08:00 2025-08-06T10:20:32.736Z INFO modelagent/configmap_reconciler.go:188 No models in cache to recreate ConfigMap 2025-08-06T18:25:32.736536435+08:00 2025-08-06T10:25:32.736Z WARN modelagent/configmap_reconciler.go:148 ConfigMap not found during reconciliation, will recreate it 2025-08-06T18:25:32.736557762+08:00 2025-08-06T10:25:32.736Z INFO modelagent/configmap_reconciler.go:188 No models in cache to recreate ConfigMap 2025-08-06T18:30:32.737317780+08:00 2025-08-06T10:30:32.737Z WARN modelagent/configmap_reconciler.go:148 ConfigMap not found during reconciliation, will recreate it 2025-08-06T18:30:32.737341969+08:00 2025-08-06T10:30:32.737Z INFO modelagent/configmap_reconciler.go:188 No models in cache to recreate ConfigMap 2025-08-06T18:35:32.736548034+08:00 2025-08-06T10:35:32.736Z WARN modelagent/configmap_reconciler.go:148 ConfigMap not found during reconciliation, will recreate it 2025-08-06T18:35:32.736572578+08:00 2025-08-06T10:35:32.736Z INFO modelagent/configmap_reconciler.go:188 No models in cache to recreate ConfigMap 2025-08-06T18:40:32.736435195+08:00 2025-08-06T10:40:32.736Z WARN modelagent/configmap_reconciler.go:148 ConfigMap not found during reconciliation, will recreate it 2025-08-06T18:40:32.736458866+08:00 2025-08-06T10:40:32.736Z INFO modelagent/configmap_reconciler.go:188 No models in cache to recreate ConfigMap 2025-08-06T18:45:32.736627834+08:00 2025-08-06T10:45:32.736Z WARN modelagent/configmap_reconciler.go:148 ConfigMap not found during reconciliation, will recreate it 2025-08-06T18:45:32.736651967+08:00 2025-08-06T10:45:32.736Z INFO modelagent/configmap_reconciler.go:188 No models in cache to recreate ConfigMap clusterbasemodel statu

kubectl get clusterbasemodel NAME DISABLED VERSION VENDOR FRAMEWORK FRAMEWORKVERSION MODELFORMAT ARCHITECTURE CAPABILITIES SIZE COMPARTMENTID READY AGE llama-3-2-1b-instruct false 1.0.0 meta In_Transit 29m

Resolved. I changed the following configuration in clusterbasemodel. It still seems unable to automatically recognize some model information. Based on the example config/runtimes/srt/llama-3-2-1b-instruct-rt.yaml, I created a clusterruntime, and finally deployed the model.

Note: KEDA is also required for the deployment of this part of the model."

apiVersion: ome.io/v1beta1
kind: ClusterBaseModel
metadata:
  name: llama-3-2-1b-instruct
spec:
  displayName: llama-3.2-1b-instruct
  vendor: meta
  version: "3.2"
  disabled: false
  modelType: llama
  modelArchitecture: LlamaForCausalLM
  modelFormat:
    name: safetensors
    version: "1.0.0"
  modelFramework:
    name: transformers
    version: "4.45.0.dev0"
  storage:
    storageUri: "pvc://ome:pvc-llama-models/LLM-Research/Llama3.2-1B-Instruct"
    path: "/data/models/LLM-Research/Llama-3.2-1B-Instruct"

inferenceSservice yaml

apiVersion: v1
kind: Namespace
metadata:
  name: llama-1b-demo
---
apiVersion: ome.io/v1beta1
kind: InferenceService
metadata:
  name: llama-3-2-1b-instruct
  namespace: llama-1b-demo
spec:
  model:
    name: llama-3-2-1b-instruct
  runtime:
    name: srt-llama-3-2-1b-instruct
  engine:
    minReplicas: 1
    maxReplicas: 1

mupeifeiyi avatar Aug 15 '25 02:08 mupeifeiyi