[SUPPORT] ClusterBaseModel using pvc:// error
Question
ome agent pod logs:
# kubectl logs -f --tail 20 -n ome ome-model-agent-daemonset-btbv2
2025-07-10T09:58:28.197Z INFO modelagent/scout.go:402 Deleting ClusterBaseModel: llama-3-2-1b-instruct
2025-07-10T09:58:28.242Z INFO modelagent/gopher.go:235 Processing gopher task: ClusterBaseModel llama-3-2-1b-instruct, type: Delete
2025-07-10T09:58:30.242Z INFO modelagent/node_label_reconciler.go:80 Processing node label Deleted operation for ClusterBaseModel llama-3-2-1b-instruct in state: Deleted
2025-07-10T09:58:30.244Z INFO modelagent/node_label_reconciler.go:110 Label models.ome.io/clusterbasemodel.llama-3-2-1b-instruct already removed from node k8s-master for ClusterBaseModel llama-3-2-1b-instruct - operation is idempotent
2025-07-10T09:58:30.244Z INFO modelagent/configmap_reconciler.go:502 Deleting model from ConfigMap: ClusterBaseModel llama-3-2-1b-instruct
2025-07-10T09:58:30.245Z INFO modelagent/configmap_reconciler.go:520 Model ClusterBaseModel llama-3-2-1b-instruct doesn't exist in ConfigMap, nothing to delete
2025-07-11T02:39:01.307Z INFO modelagent/scout.go:208 Processing ClusterBaseModel: llama-3-2-1b-instruct
2025-07-11T02:39:01.309Z INFO modelagent/scout.go:223 Downloading ClusterBaseModel: llama-3-2-1b-instruct
2025-07-11T02:39:01.396Z INFO modelagent/gopher.go:235 Processing gopher task: ClusterBaseModel llama-3-2-1b-instruct, type: Download
2025-07-11T02:39:01.396Z INFO modelagent/gopher.go:253 Setting model ClusterBaseModel llama-3-2-1b-instruct status to Updating before download
2025-07-11T02:39:01.396Z INFO modelagent/node_label_reconciler.go:80 Processing node label Updating operation for ClusterBaseModel llama-3-2-1b-instruct in state: Updating
2025-07-11T02:39:01.404Z INFO modelagent/node_label_reconciler.go:171 Successfully patched node k8s-master with Updating state for ClusterBaseModel llama-3-2-1b-instruct
2025-07-11T02:39:01.404Z INFO modelagent/configmap_reconciler.go:356 Reconciling model status in ConfigMap for ClusterBaseModel llama-3-2-1b-instruct with status: Updating
2025-07-11T02:39:01.405Z INFO modelagent/configmap_reconciler.go:740 Updating ConfigMap 'k8s-master' in namespace 'ome' for ClusterBaseModel llama-3-2-1b-instruct
2025-07-11T02:39:01.407Z INFO modelagent/configmap_reconciler.go:746 Successfully updated ConfigMap 'k8s-master' in namespace 'ome' for ClusterBaseModel llama-3-2-1b-instruct
2025-07-11T02:39:01.407Z INFO modelagent/configmap_reconciler.go:402 Successfully updated ConfigMap and cache for ClusterBaseModel llama-3-2-1b-instruct with status: Updating
2025-07-11T02:39:01.407Z INFO modelagent/gopher.go:302 Starting download for model ClusterBaseModel llama-3-2-1b-instruct
2025-07-11T02:39:01.407Z ERROR modelagent/gopher.go:142 Gopher task failed with error: unknown storage type PVC
github.com/sgl-project/ome/pkg/modelagent.(*Gopher).runWorker
/workspace/pkg/modelagent/gopher.go:142
I'm not quite sure how ClusterBaseModel identifies the PVC namespace
What did you try?
I tried to create a basemodel in the same namespace as pvc, and got the same error.
Environment
- OME version: main branch
- Kubernetes version: 1.31
- Model being served (if applicable):
Additional context
Install form source
# Clone the repository
git clone https://github.com/sgl-project/ome.git
cd ome
# Install from local charts
helm install ome-crd charts/ome-crd --namespace ome --create-namespace
helm install ome charts/ome-resources --namespace ome
Create clusterbasemodel using config/models/meta/Llama-3.2-1B-Instruct.yaml. My modifications are as follows:
apiVersion: ome.io/v1beta1
kind: ClusterBaseModel
metadata:
name: llama-3-2-1b-instruct
spec:
displayName: meta.llama-3.2-1b-instruct
vendor: meta
disabled: false
version: "1.0.0"
storage:
storageUri: "pvc://pvc-llama-checkpoints/Llama3.2-1B-Instruct"
path: "/local/models/llama-3.2-1b-instruct"
pv,pvc file:
apiVersion: v1
kind: PersistentVolume
metadata:
name: pv-llama-checkpoints
spec:
capacity:
storage: 20Gi
volumeMode: Filesystem
accessModes:
- ReadWriteMany
persistentVolumeReclaimPolicy: Retain
storageClassName: local-storage
hostPath:
path: /root/.llama/checkpoints
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: pvc-llama-checkpoints
namespace: ome
spec:
accessModes:
- ReadWriteMany
volumeMode: Filesystem
resources:
requests:
storage: 10Gi
storageClassName: local-storage
model path:
# ls /root/.llama/checkpoints/Llama3.2-1B-Instruct/
checklist.chk consolidated.00.pth params.json tokenizer.model
ome container status:
# kubectl get po -n ome
NAME READY STATUS RESTARTS AGE
ome-controller-manager-7885447567-nqdx9 1/1 Running 0 2d16h
ome-model-agent-daemonset-btbv2 1/1 Running 0 2d16h
ome-model-agent-daemonset-xlwlm 1/1 Running 0 2d16h
I don't think PV and pvc support for base models are properly implemented. I will address that Thanks for raising this up
I found the location where the exception was thrown. I have made some modifications to the local code. I can only ensure that the pvc type does not need to be downloaded. I am not sure what other logic needs to be processed later. The following is the code I tried to add locally. After building the image, ome-model-agent no longer throws an exception.
pkg/modelagent/gopher.go
case DownloadOverride:
...
case storage.StorageTypeVendor:
s.logger.Infof("Skipping download for model %s", modelInfo)
case storage.StorageTypePVC:
s.logger.Infof("Skipping download for model %s", modelInfo)
...
default:
return fmt.Errorf("unknown storage type %s", storageType)
}
logs:
2025-07-11T07:55:41.391Z INFO modelagent/scout.go:208 Processing ClusterBaseModel: llama-3-2-1b-instruct
2025-07-11T07:55:41.393Z INFO modelagent/scout.go:223 Downloading ClusterBaseModel: llama-3-2-1b-instruct
2025-07-11T07:55:41.840Z INFO modelagent/gopher.go:235 Processing gopher task: ClusterBaseModel llama-3-2-1b-instruct, type: Download
2025-07-11T07:55:41.840Z INFO modelagent/gopher.go:253 Setting model ClusterBaseModel llama-3-2-1b-instruct status to Updating before download
2025-07-11T07:55:41.840Z INFO modelagent/node_label_reconciler.go:80 Processing node label Updating operation for ClusterBaseModel llama-3-2-1b-instruct in state: Updating
2025-07-11T07:55:41.848Z INFO modelagent/node_label_reconciler.go:171 Successfully patched node k8s-node01 with Updating state for ClusterBaseModel llama-3-2-1b-instruct
2025-07-11T07:55:41.848Z INFO modelagent/configmap_reconciler.go:356 Reconciling model status in ConfigMap for ClusterBaseModel llama-3-2-1b-instruct with status: Updating
2025-07-11T07:55:41.849Z INFO modelagent/configmap_reconciler.go:740 Updating ConfigMap 'k8s-node01' in namespace 'ome' for ClusterBaseModel llama-3-2-1b-instruct
2025-07-11T07:55:41.852Z INFO modelagent/configmap_reconciler.go:746 Successfully updated ConfigMap 'k8s-node01' in namespace 'ome' for ClusterBaseModel llama-3-2-1b-instruct
2025-07-11T07:55:41.852Z INFO modelagent/configmap_reconciler.go:402 Successfully updated ConfigMap and cache for ClusterBaseModel llama-3-2-1b-instruct with status: Updating
2025-07-11T07:55:41.852Z INFO modelagent/gopher.go:302 Starting download for model ClusterBaseModel llama-3-2-1b-instruct
2025-07-11T07:55:41.852Z INFO modelagent/gopher.go:360 Skipping download for model ClusterBaseModel llama-3-2-1b-instruct
2025-07-11T07:55:41.852Z INFO modelagent/gopher.go:382 Successfully downloaded ClusterBaseModel llama-3-2-1b-instruct
2025-07-11T07:55:41.852Z INFO modelagent/node_label_reconciler.go:80 Processing node label Ready operation for ClusterBaseModel llama-3-2-1b-instruct in state: Ready
2025-07-11T07:55:41.860Z INFO modelagent/node_label_reconciler.go:171 Successfully patched node k8s-node01 with Ready state for ClusterBaseModel llama-3-2-1b-instruct
2025-07-11T07:55:41.860Z INFO modelagent/configmap_reconciler.go:356 Reconciling model status in ConfigMap for ClusterBaseModel llama-3-2-1b-instruct with status: Ready
2025-07-11T07:55:41.861Z INFO modelagent/configmap_reconciler.go:740 Updating ConfigMap 'k8s-node01' in namespace 'ome' for ClusterBaseModel llama-3-2-1b-instruct
2025-07-11T07:55:41.863Z INFO modelagent/configmap_reconciler.go:746 Successfully updated ConfigMap 'k8s-node01' in namespace 'ome' for ClusterBaseModel llama-3-2-1b-instruct
2025-07-11T07:55:41.863Z INFO modelagent/configmap_reconciler.go:402 Successfully updated ConfigMap and cache for ClusterBaseModel llama-3-2-1b-instruct with status: Ready
clusterbasemodel:
# kubectl get clusterbasemodels.ome.io
NAME DISABLED VERSION VENDOR FRAMEWORK FRAMEWORKVERSION MODELFORMAT ARCHITECTURE CAPABILITIES SIZE COMPARTMENTID READY AGE
llama-3-2-1b-instruct false 1.0.0 meta Ready 14m
I read the description in the document that some of these fields should be automatically parsed, but it seems that they are not parsed here. I may need your guidance. Try running it locally.
https://docs.sglang.ai/ome/docs/concepts/base_model/#automatic-model-discovery As mentioned in the link, the 1B model I downloaded does not have the config.json file
https://docs.sglang.ai/ome/docs/concepts/base_model/#automatic-model-discovery As mentioned in the link, the 1B model I downloaded does not have the config.json file
Sorry, I see. I didn't download the model from hf, but from meta, so there is no config.json. I'm going to try downloading it from hf.