prefect icon indicating copy to clipboard operation
prefect copied to clipboard

Add VolcanoWorker to support Volcano Job submission via Kubernetes work pool and test

Open zyn71 opened this issue 8 months ago • 14 comments

Apologies for the noise on the earlier PR https://github.com/PrefectHQ/prefect/pull/17880. I am sry While iterating on the Volcano worker I accidentally pushed several unrelated commits that polluted the diff. I’ve started fresh from the latest upstream/main, cherry-picked only the relevant changes, and added a focused unit-test suite.

The previous PR has been closed in favor of this clean one. Changes included here (3 files):

volcanoworker.py – new VolcanoWorker implementation

init.py – registers the new worker type

tests/test_volcano_worker.py – unit & smoke tests covering job creation, pod discovery, and state polling

Please let me know if you’d like additional work

zyn71 avatar Apr 25 '25 09:04 zyn71

CodSpeed Performance Report

Merging #17909 will not alter performance

Comparing zyn71:feat/volcano_clean (5c10a28) with main (df10951)

Summary

✅ 2 untouched benchmarks

codspeed-hq[bot] avatar Apr 28 '25 08:04 codspeed-hq[bot]

hi,@reneleonhardt @zzstoatzz (we have argued before) Sorry to nudge, but I wanted to gently follow-up on this PR. It’s been open for three days, the checks (CI, pre-commit, dev env) have all passed, and the diff is limited to three files.And it has already been applied into out porduction. If there’s anything I can clarify or adjust, just let me know—I’m happy to turn it around quickly.Thanks a lot for your time and for keeping the review queue moving!

zyn71 avatar Apr 28 '25 09:04 zyn71

I'm no maintainer sorry. I suggest using a free AI to save contributors and reviewers time: https://www.coderabbit.ai/blog/how-linux-foundation-used-ai-code-reviews-to-reduce-manual-bottlenecks-in-oss

reneleonhardt avatar Apr 28 '25 10:04 reneleonhardt

Hey @zyn71, neither @zzstoatzz nor I are very familiar with Volcano, and we'll want to test this functionality before merging. Could you provide instructions on how we can test these changes locally?

desertaxle avatar Apr 29 '25 13:04 desertaxle

Hi @desertaxle — thank you for the quick feedback! Below is a full, single-file guide my team follows to test the new VolcanoWorker locally. Everything has been validated on real clusters;


1 Get server and workpool

Create a conventional prefect server and prefect-kubernetes workpool.So far nothing is Volcano-specific.


2 Convert K8s work pool to “Volcano mode” manully

Open K8s Work Pool → Advanced → Job Manifest and paste the manifest below.
It’s the normal template with apiVersion: batch.volcano.sh/v1alpha1 and Volcano-only fields added:

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  labels: "{{ labels }}"
  namespace: "{{ namespace }}"
  generateName: "{{ name }}-"
spec:
  schedulerName: volcano
  queue: your-volcano-queue
  maxRetry: 3
  minAvailable: 1
  tasks:
    - name: pod1
      replicas: 1
      policies:
        - event: TaskCompleted
          action: CompleteJob
      template:
        metadata:
          labels:
            job-name: "{{ name }}"
            volcano.sh/job-name: "{{ name }}"
        spec:
          restartPolicy: Never
          containers:
            - name: c1
              image: "{{ image }}"
              command: "{{ command }}"
              env: "{{ env }}"
              imagePullPolicy: "{{ image_pull_policy }}"

3 Start a VolcanoWorker

Before merging into main branch, we must build a new image including above there changes files.(BASE is the newest version prefect)

docker build -t your_image .

We can uses helm chart to create a volcanoworker

helm pull prefect/prefect-worker          # fetch chart
tar -xzf prefect-worker-*.tgz && cd prefect-worker

# edit values.yaml (connect to prefect server by your owns)➜
config:
    type: volcano
    workPool:"your work pool"
image:
    repository: your image
    prefectTag: your tag

helm install prefect-worker . -f values.yaml -n your-namespace

No extra image build is needed after merging


4 Deploy & run a flow

Any deployment assigned to my-k8s-pool now spawns a real Volcano Job.
Logs, exit-codes, and state reporting behave exactly like the stock K8s worker.nothing is Volcano-specific


Extra notes

  • Zero intrusion: VolcanoWorker subclasses KubernetesWorker; normal paths are untouched unless apiVersion is Volcano.
  • Tests: added tests/test_volcanoworker.py covering job creation, pod discovery, and state-watching.
  • Happy to iterate further — just let me know what else you need!

Thanks again for your time!

zyn71 avatar Apr 30 '25 06:04 zyn71

Thanks for that guide @zyn71! Do I need to install Volcano in my cluster before using the Volcano worker? Also, I think it'd be best if this worker didn't require manual changes to the base job template in the default case. Would you be able to add _get_default_job_manifest_template similar to the existing KubernetesWorker?

desertaxle avatar May 01 '25 14:05 desertaxle

hi again @desertaxle !Yes, Volcano must be installed in the cluster before the VolcanoWorker can be used.
Because this PR has not been merged yet, you also need to build a custom worker image (step 3 in my guide): simply copy volcanoworker.py and the updated __init__.py into the current Prefect package, build the image, and then start the worker with --type volcano.

Regarding your concern about avoiding manual edits to the base job template: Prefect validates the job manifest in several layers, not only inside the worker. Making the worker auto‑switch templates would therefore require extensive changes—perhaps even a brand‑new integration. Considering that our goal is just to extend prefect‑kubernetes with Volcano support, and since VolcanoWorker already inherits from KubernetesWorker, the current solution remains flexible and easy to use.

If there is anything else you need, please let me know. Thank you!

zyn71 avatar May 02 '25 04:05 zyn71

@zyn71 I disagree that it would take extensive updates to support a different default manifest for the Volcano worker. The primary validation of the manifest happens in KubernetesWorkerJobConfiguration._validate_job_manifest, which you are already overriding in this PR. I think it makes more sense to have that validation be Volcano-specific and have a default manifest that allows the Volcano worker to be used without manual configuration.

I think one thing that would help is that the Volcano worker doesn't need to be compatible with vanilla Kubernetes jobs. I think the overall design of the new worker would be greatly improved if you dropped that compatibility.

desertaxle avatar May 02 '25 16:05 desertaxle

hi @desertaxle !I totally understand your point—the first thing I tried was exactly what you’re suggesting.
However, the default manifest is injected at the work‑pool level, not at the worker level.
Right now Prefect only ships a single work‑pool type (kubernetes). When you create a Kubernetes work pool, the UI automatically seeds a Kubernetes Job template; later, the worker validates whatever manifest lives in that pool.

If we want to eliminate all manual edits, we would first need to introduce a brand‑new Volcano work‑pool type (or extend the existing pool so it can return a Volcano‑flavored template). That’s only part of the story, though:

  • During local testing I noticed that validation happens in several places.
    In addition to the worker’s JobConfiguration._validate_job_manifest, the flow‑run machinery also validates the manifest that comes from the work pool, and other internal modules touch it as well.
  • So even with a Volcano work pool, we’d still have to propagate changes across multiple code paths, which feels like a fairly invasive refactor.

Because Volcano can sit on top of standard Kubernetes primitives, my goal was to reuse the existing kubernetes work‑pool infrastructure and just add a lightweight VolcanoWorker on top. That avoids a large amount of churn in the core codebase while still giving users an extra scheduling option.

zyn71 avatar May 05 '25 00:05 zyn71

@zyn71 we have processes in place to add new supported work pools to the UI, so it will be possible to create a Volcano worker from the UI once the Volcano worker is added to prefect-kubernetes. In the meantime, workers can create work pools when they start up, so running prefect worker start --type volcano WORK_POOL_NAME can create a Volcano work pool. That's why I'm insistent that the Volcano worker should have its own default manifest.

I appreciate your desire not to introduce new churn to the codebase, but I don't think you'll introduce much churn since you're adding something new. If the Volcano worker is going to share code with the Kubernetes worker, I'd like them to be as loosely coupled as possible. One way you can do that is by ensuring the Volcano worker is only responsible for running Volcano jobs.

desertaxle avatar May 05 '25 13:05 desertaxle

Hi @desertaxle! Thanks a lot for the suggestions and guidance. I’ve now added a _get_default_volcano_job_manifest_template() to VolcanoWorker, so we have anative Volcano base manifest.

Following your advice, once the UI can create a Volcano work‑pool (with itsown default job manifest), this worker will be much more self‑contained andloosely coupled with the existing Kubernetes worker.

Could you take a look and let me know if this approach works for you? I’m happy to make further adjustments if needed. Thanks!

zyn71 avatar May 07 '25 06:05 zyn71

Hey @zyn71, we made some pretty significant changes to the Kubernetes worker in https://github.com/PrefectHQ/prefect/pull/18004 that will require changes to the worker you're working on here.

The big change is splitting event emission and crash detection responsibilities into a separate observer that starts alongside the worker. You might be able to reuse the observer, but I'm not sure if the jobs event handler will work with Volanco jobs. Let me know if you have any questions on the new structure!

desertaxle avatar May 09 '25 14:05 desertaxle

hi! @desertaxle,Sorry for the delayed response - I've been quite busy lately and this took longer than expected. First of all, I want to say that the refactoring work you did in #18004 is excellent! It addresses many issues and significantly improves the reliability and architecture of the Kubernetes worker. I've successfully adapted my VolcanoWorker to work with the new structure. The good news is that I was able to reuse your observer implementation without any modifications - it works perfectly with Volcano jobs! Here's what I did to make it compatible:

  1. Removed old monitoring logic: Eliminated the custom _watch_job(), _get_job(), and event replication code since the observer now handles all of this.
  2. Simplified the run() method: Now it only creates the job and returns immediately, letting the observer handle monitoring.
  3. Fixed Pod label propagation: This was the key part - I updated the _propagate_labels_to_pod() method to correctly handle Volcano job's pod template structure (spec.tasks[0].template.metadata.labels instead of spec.template.metadata.labels), ensuring that prefect.io/flow-run-id labels are properly propagated to pods so the observer can identify them.
  4. Updated job manifest template: Made sure the Volcano job template includes {{ labels }} placeholders in the pod template. I've tested the implementation and confirmed that:
  • The observer correctly identifies and monitors Volcano job pods
  • Event replication works as expected
  • Crash detection functions properly
  • Both standard Kubernetes jobs and Volcano jobs work seamlessly Thanks again for the great work on the refactoring - it makes the codebase much cleaner and more robust!

zyn71 avatar May 28 '25 01:05 zyn71

Hi! Our MR hasn’t made any progress in a week. Do you have any suggestions, or is there anything else I can do to move it forward or improve it?

zyn71 avatar Jun 04 '25 02:06 zyn71

hi @zyn71 - sorry for the delay here. desertaxle is out on leave right now. one of us will review your updates as soon as possible

zzstoatzz avatar Jun 17 '25 15:06 zzstoatzz

This pull request is stale because it has been open 14 days with no activity. To keep this pull request open remove stale label or comment.

github-actions[bot] avatar Jul 01 '25 16:07 github-actions[bot]

This pull request was closed because it has been stale for 14 days with no activity. If this pull request is important or you have more to add feel free to re-open it.

github-actions[bot] avatar Jul 15 '25 17:07 github-actions[bot]

hi,its been a long time.can i do sth to keep the pr forward?

zyn71 avatar Jul 22 '25 09:07 zyn71