ome icon indicating copy to clipboard operation
ome copied to clipboard

[SUPPORT] ClusterBaseModel using pvc:// error

Open mupeifeiyi opened this issue 5 months ago • 4 comments

Question

ome agent pod logs:

# kubectl logs -f --tail 20 -n ome  ome-model-agent-daemonset-btbv2
2025-07-10T09:58:28.197Z	INFO	modelagent/scout.go:402	Deleting ClusterBaseModel: llama-3-2-1b-instruct
2025-07-10T09:58:28.242Z	INFO	modelagent/gopher.go:235	Processing gopher task: ClusterBaseModel llama-3-2-1b-instruct, type: Delete
2025-07-10T09:58:30.242Z	INFO	modelagent/node_label_reconciler.go:80	Processing node label Deleted operation for ClusterBaseModel llama-3-2-1b-instruct in state: Deleted
2025-07-10T09:58:30.244Z	INFO	modelagent/node_label_reconciler.go:110	Label models.ome.io/clusterbasemodel.llama-3-2-1b-instruct already removed from node k8s-master for ClusterBaseModel llama-3-2-1b-instruct - operation is idempotent
2025-07-10T09:58:30.244Z	INFO	modelagent/configmap_reconciler.go:502	Deleting model from ConfigMap: ClusterBaseModel llama-3-2-1b-instruct
2025-07-10T09:58:30.245Z	INFO	modelagent/configmap_reconciler.go:520	Model ClusterBaseModel llama-3-2-1b-instruct doesn't exist in ConfigMap, nothing to delete
2025-07-11T02:39:01.307Z	INFO	modelagent/scout.go:208	Processing ClusterBaseModel: llama-3-2-1b-instruct
2025-07-11T02:39:01.309Z	INFO	modelagent/scout.go:223	Downloading ClusterBaseModel: llama-3-2-1b-instruct
2025-07-11T02:39:01.396Z	INFO	modelagent/gopher.go:235	Processing gopher task: ClusterBaseModel llama-3-2-1b-instruct, type: Download
2025-07-11T02:39:01.396Z	INFO	modelagent/gopher.go:253	Setting model ClusterBaseModel llama-3-2-1b-instruct status to Updating before download
2025-07-11T02:39:01.396Z	INFO	modelagent/node_label_reconciler.go:80	Processing node label Updating operation for ClusterBaseModel llama-3-2-1b-instruct in state: Updating
2025-07-11T02:39:01.404Z	INFO	modelagent/node_label_reconciler.go:171	Successfully patched node k8s-master with Updating state for ClusterBaseModel llama-3-2-1b-instruct
2025-07-11T02:39:01.404Z	INFO	modelagent/configmap_reconciler.go:356	Reconciling model status in ConfigMap for ClusterBaseModel llama-3-2-1b-instruct with status: Updating
2025-07-11T02:39:01.405Z	INFO	modelagent/configmap_reconciler.go:740	Updating ConfigMap 'k8s-master' in namespace 'ome' for ClusterBaseModel llama-3-2-1b-instruct
2025-07-11T02:39:01.407Z	INFO	modelagent/configmap_reconciler.go:746	Successfully updated ConfigMap 'k8s-master' in namespace 'ome' for ClusterBaseModel llama-3-2-1b-instruct
2025-07-11T02:39:01.407Z	INFO	modelagent/configmap_reconciler.go:402	Successfully updated ConfigMap and cache for ClusterBaseModel llama-3-2-1b-instruct with status: Updating
2025-07-11T02:39:01.407Z	INFO	modelagent/gopher.go:302	Starting download for model ClusterBaseModel llama-3-2-1b-instruct
2025-07-11T02:39:01.407Z	ERROR	modelagent/gopher.go:142	Gopher task failed with error: unknown storage type PVC
github.com/sgl-project/ome/pkg/modelagent.(*Gopher).runWorker
	/workspace/pkg/modelagent/gopher.go:142

I'm not quite sure how ClusterBaseModel identifies the PVC namespace

What did you try?

I tried to create a basemodel in the same namespace as pvc, and got the same error.

Environment

  • OME version: main branch
  • Kubernetes version: 1.31
  • Model being served (if applicable):

Additional context

Install form source

# Clone the repository
git clone https://github.com/sgl-project/ome.git
cd ome

# Install from local charts
helm install ome-crd charts/ome-crd --namespace ome --create-namespace
helm install ome charts/ome-resources --namespace ome

Create clusterbasemodel using config/models/meta/Llama-3.2-1B-Instruct.yaml. My modifications are as follows:

apiVersion: ome.io/v1beta1
kind: ClusterBaseModel
metadata:
  name: llama-3-2-1b-instruct
spec:
  displayName: meta.llama-3.2-1b-instruct
  vendor: meta
  disabled: false
  version: "1.0.0"
  storage:
    storageUri: "pvc://pvc-llama-checkpoints/Llama3.2-1B-Instruct"
    path: "/local/models/llama-3.2-1b-instruct"

pv,pvc file:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: pv-llama-checkpoints
spec:
  capacity:
    storage: 20Gi
  volumeMode: Filesystem
  accessModes:
    - ReadWriteMany
  persistentVolumeReclaimPolicy: Retain
  storageClassName: local-storage
  hostPath:
    path: /root/.llama/checkpoints
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: pvc-llama-checkpoints
  namespace: ome
spec:
  accessModes:
    - ReadWriteMany
  volumeMode: Filesystem
  resources:
    requests:
      storage: 10Gi
  storageClassName: local-storage

model path:

# ls /root/.llama/checkpoints/Llama3.2-1B-Instruct/
checklist.chk  consolidated.00.pth  params.json  tokenizer.model

ome container status:

# kubectl get po -n ome
NAME                                      READY   STATUS    RESTARTS   AGE
ome-controller-manager-7885447567-nqdx9   1/1     Running   0          2d16h
ome-model-agent-daemonset-btbv2           1/1     Running   0          2d16h
ome-model-agent-daemonset-xlwlm           1/1     Running   0          2d16h

mupeifeiyi avatar Jul 11 '25 03:07 mupeifeiyi

I don't think PV and pvc support for base models are properly implemented. I will address that Thanks for raising this up

slin1237 avatar Jul 11 '25 03:07 slin1237

I found the location where the exception was thrown. I have made some modifications to the local code. I can only ensure that the pvc type does not need to be downloaded. I am not sure what other logic needs to be processed later. The following is the code I tried to add locally. After building the image, ome-model-agent no longer throws an exception. pkg/modelagent/gopher.go

	case DownloadOverride:
        ...
		case storage.StorageTypeVendor:
			s.logger.Infof("Skipping download for model %s", modelInfo)
                case storage.StorageTypePVC:
                        s.logger.Infof("Skipping download for model %s", modelInfo)
        ...
		default:
			return fmt.Errorf("unknown storage type %s", storageType)
		}

logs:

2025-07-11T07:55:41.391Z	INFO	modelagent/scout.go:208	Processing ClusterBaseModel: llama-3-2-1b-instruct
2025-07-11T07:55:41.393Z	INFO	modelagent/scout.go:223	Downloading ClusterBaseModel: llama-3-2-1b-instruct
2025-07-11T07:55:41.840Z	INFO	modelagent/gopher.go:235	Processing gopher task: ClusterBaseModel llama-3-2-1b-instruct, type: Download
2025-07-11T07:55:41.840Z	INFO	modelagent/gopher.go:253	Setting model ClusterBaseModel llama-3-2-1b-instruct status to Updating before download
2025-07-11T07:55:41.840Z	INFO	modelagent/node_label_reconciler.go:80	Processing node label Updating operation for ClusterBaseModel llama-3-2-1b-instruct in state: Updating
2025-07-11T07:55:41.848Z	INFO	modelagent/node_label_reconciler.go:171	Successfully patched node k8s-node01 with Updating state for ClusterBaseModel llama-3-2-1b-instruct
2025-07-11T07:55:41.848Z	INFO	modelagent/configmap_reconciler.go:356	Reconciling model status in ConfigMap for ClusterBaseModel llama-3-2-1b-instruct with status: Updating
2025-07-11T07:55:41.849Z	INFO	modelagent/configmap_reconciler.go:740	Updating ConfigMap 'k8s-node01' in namespace 'ome' for ClusterBaseModel llama-3-2-1b-instruct
2025-07-11T07:55:41.852Z	INFO	modelagent/configmap_reconciler.go:746	Successfully updated ConfigMap 'k8s-node01' in namespace 'ome' for ClusterBaseModel llama-3-2-1b-instruct
2025-07-11T07:55:41.852Z	INFO	modelagent/configmap_reconciler.go:402	Successfully updated ConfigMap and cache for ClusterBaseModel llama-3-2-1b-instruct with status: Updating
2025-07-11T07:55:41.852Z	INFO	modelagent/gopher.go:302	Starting download for model ClusterBaseModel llama-3-2-1b-instruct
2025-07-11T07:55:41.852Z	INFO	modelagent/gopher.go:360	Skipping download for model ClusterBaseModel llama-3-2-1b-instruct
2025-07-11T07:55:41.852Z	INFO	modelagent/gopher.go:382	Successfully downloaded ClusterBaseModel llama-3-2-1b-instruct
2025-07-11T07:55:41.852Z	INFO	modelagent/node_label_reconciler.go:80	Processing node label Ready operation for ClusterBaseModel llama-3-2-1b-instruct in state: Ready
2025-07-11T07:55:41.860Z	INFO	modelagent/node_label_reconciler.go:171	Successfully patched node k8s-node01 with Ready state for ClusterBaseModel llama-3-2-1b-instruct
2025-07-11T07:55:41.860Z	INFO	modelagent/configmap_reconciler.go:356	Reconciling model status in ConfigMap for ClusterBaseModel llama-3-2-1b-instruct with status: Ready
2025-07-11T07:55:41.861Z	INFO	modelagent/configmap_reconciler.go:740	Updating ConfigMap 'k8s-node01' in namespace 'ome' for ClusterBaseModel llama-3-2-1b-instruct
2025-07-11T07:55:41.863Z	INFO	modelagent/configmap_reconciler.go:746	Successfully updated ConfigMap 'k8s-node01' in namespace 'ome' for ClusterBaseModel llama-3-2-1b-instruct
2025-07-11T07:55:41.863Z	INFO	modelagent/configmap_reconciler.go:402	Successfully updated ConfigMap and cache for ClusterBaseModel llama-3-2-1b-instruct with status: Ready

clusterbasemodel:

# kubectl get clusterbasemodels.ome.io 
NAME                    DISABLED   VERSION   VENDOR   FRAMEWORK   FRAMEWORKVERSION   MODELFORMAT   ARCHITECTURE   CAPABILITIES   SIZE   COMPARTMENTID   READY   AGE
llama-3-2-1b-instruct   false      1.0.0     meta                                                                                                       Ready   14m

I read the description in the document that some of these fields should be automatically parsed, but it seems that they are not parsed here. I may need your guidance. Try running it locally.

mupeifeiyi avatar Jul 11 '25 08:07 mupeifeiyi

https://docs.sglang.ai/ome/docs/concepts/base_model/#automatic-model-discovery As mentioned in the link, the 1B model I downloaded does not have the config.json file

mupeifeiyi avatar Jul 11 '25 08:07 mupeifeiyi

https://docs.sglang.ai/ome/docs/concepts/base_model/#automatic-model-discovery As mentioned in the link, the 1B model I downloaded does not have the config.json file

Sorry, I see. I didn't download the model from hf, but from meta, so there is no config.json. I'm going to try downloading it from hf.

mupeifeiyi avatar Jul 11 '25 08:07 mupeifeiyi