training-operator icon indicating copy to clipboard operation
training-operator copied to clipboard

[feature] Can we use one headless service for one job?

Open gaocegege opened this issue 6 years ago • 11 comments

We have ps/worker/chief for one TFJob. And now we create one headless service for one replica. I think we can use one headless service for easy-to-use.

After that, we could use {tfjob_name}-{replica_type}-{index}.{service_name}.svc.cluster.local in the code.

WDYT @johnugeorge @richardsliu

gaocegege avatar Jun 21 '19 07:06 gaocegege

Issue-Label Bot is automatically applying the label improvement/enhancement to this issue, with a confidence of 0.70. Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback!

Links: app homepage, dashboard and code for this bot.

issue-label-bot[bot] avatar Jun 21 '19 07:06 issue-label-bot[bot]

/area engprod /priority p2

jtfogarty avatar Jan 14 '20 20:01 jtfogarty

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Apr 24 '20 17:04 stale[bot]

/reopen We should take this to improve cluster performance.

tenzen-y avatar Jul 18 '23 20:07 tenzen-y

@tenzen-y: Reopened this issue.

In response to this:

/reopen We should take this to improve cluster performance.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

google-oss-prow[bot] avatar Jul 18 '23 20:07 google-oss-prow[bot]

I realized this need by Aldo's comment.

cc: @kubeflow/wg-training-leads

tenzen-y avatar Jul 18 '23 20:07 tenzen-y

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Oct 17 '23 00:10 github-actions[bot]

/lifecycle frozen

tenzen-y avatar Oct 17 '23 12:10 tenzen-y

@tenzen-y brought this up in brainstorming around jobset/kubeflow.

We have implemented a few ways to customize network names.

kannon92 avatar Apr 10 '24 12:04 kannon92

type Network struct {
	// EnableDNSHostnames allows pods to be reached via their hostnames.
	// Pods will be reachable using the fully qualified pod hostname:
	// <jobSet.name>-<spec.replicatedJob.name>-<job-index>-<pod-index>.<subdomain>
	// +optional
	EnableDNSHostnames *bool `json:"enableDNSHostnames,omitempty"`

	// Subdomain is an explicit choice for a network subdomain name
	// When set, any replicated job in the set is added to this network.
	// Defaults to <jobSet.name> if not set.
	// +optional
	Subdomain string `json:"subdomain,omitempty"`
}

Was what we used to control service creation for the jobset.

kannon92 avatar Apr 10 '24 12:04 kannon92

The suffix will differ from .svc.cluster.local according to the cluster settings. Maybe we could use a CLI parameter to config it.

gaocegege avatar Apr 11 '24 02:04 gaocegege