How does the tidb-operator deploy a pd cluster?

Open Phoenix500526 opened this issue 1 year ago • 3 comments


Hi everyone, I've been studying the tidb-operator code recently and I've encountered some issues. Even after reading related blogs, I'm still confused.

When deploying a pd cluster, the pdMemberManager will add mounted data volumes and mount paths to the corresponding StatefulSet template as follows:

func (pmm *pdMemberManager) getNewPDSetForTidbCluster(tc *v1alpha1.TidbCluster) (*apps.StatefulSet, error) {
        volMounts := []corev1.VolumeMount{
                {Name: "config", ReadOnly: true, MountPath: "/etc/pd"},
                {Name: "startup-script", ReadOnly: true, MountPath: "/usr/local/bin"},
                {Name: v1alpha1.PDMemberType.String(), MountPath: "/var/lib/pd"},
        vols := []corev1.Volume{
                {Name: "config",
                        VolumeSource: corev1.VolumeSource{
                                ConfigMap: &corev1.ConfigMapVolumeSource{
                                        LocalObjectReference: corev1.LocalObjectReference{
                                                Name: pdConfigMap,
                                        Items: []corev1.KeyToPath{{Key: "config-file", Path: "pd.toml"}},
                {Name: "startup-script",
                        VolumeSource: corev1.VolumeSource{
                                ConfigMap: &corev1.ConfigMapVolumeSource{
                                        LocalObjectReference: corev1.LocalObjectReference{
                                                Name: pdConfigMap,
                                        Items: []corev1.KeyToPath{{Key: "startup-script", Path: "pd_start_script.sh"}},

        pdSet := &apps.StatefulSet{
                Spec: apps.StatefulSetSpec{
                        Replicas: func() *int32 { r := tc.Spec.PD.Replicas + int32(failureReplicas); return &r }(),
                        Selector: pdLabel.LabelSelector(),
                        Template: corev1.PodTemplateSpec{
                                ObjectMeta: metav1.ObjectMeta{
                                        Labels:      pdLabel.Labels(),
                                        Annotations: podAnnotations,
                                Spec: corev1.PodSpec{
                                        SchedulerName: tc.Spec.SchedulerName,
                                        Affinity:      tc.Spec.PD.Affinity,
                                        NodeSelector:  tc.Spec.PD.NodeSelector,
                                        Containers: []corev1.Container{
                                                        Image:           tc.Spec.PD.Image,
                                                        Command:         []string{"/bin/sh", "/usr/local/bin/pd_start_script.sh"},
                                                        ImagePullPolicy: tc.Spec.PD.ImagePullPolicy,
                                                        Ports: []corev1.ContainerPort{
                                                                        Name:          "server",
                                                                        ContainerPort: int32(2380),
                                                                        Protocol:      corev1.ProtocolTCP,
                                                                        Name:          "client",
                                                                        ContainerPort: int32(2379),
                                                                        Protocol:      corev1.ProtocolTCP,
                                                        VolumeMounts: volMounts,

When a PD pod starts, it first executes /usr/local/bin/pd_start_script.sh, as follows:


# This script is used to start pd containers in kubernetes cluster

# Use DownwardAPIVolumeFiles to store informations of the cluster:
# https://kubernetes.io/docs/tasks/inject-data-application/downward-api-volume-expose-pod-information/#the-downward-api
#   runmode="normal/debug"

set -uo pipefail


if [[ ! -f "${ANNOTATIONS}" ]]
    echo "${ANNOTATIONS} does't exist, exiting."
    exit 1
source ${ANNOTATIONS} 2>/dev/null

if [[ X${runmode} == Xdebug ]]
    echo "entering debug mode."
    tail -f /dev/null

# the general form of variable PEER_SERVICE_NAME is: "<clusterName>-pd-peer"
cluster_name=`echo ${PEER_SERVICE_NAME} | sed 's/-pd-peer//'`
encoded_domain_url=`echo ${domain}:2380 | base64 | tr "\n" " " | sed "s/ //g"`

while true; do
    sleep ${period}
    elapseTime=$(( elapseTime+period ))

    if [[ ${elapseTime} -ge ${threshold} ]]
        echo "waiting for pd cluster ready timeout" >&2
        exit 1

    if nslookup ${domain} 2>/dev/null
        echo "nslookup domain ${domain}.svc success"
        echo "nslookup domain ${domain} failed" >&2

# The content of /etc/pd/pd.toml is as follows:
#    [log]
#    level = "info"
#    [replication]
#    location-labels = ["region", "zone", "rack", "host"]
ARGS="--data-dir=/var/lib/pd \
--name=${HOSTNAME} \
--peer-urls= \
--advertise-peer-urls=http://${domain}:2380 \
--client-urls= \
--advertise-client-urls=http://${domain}:2379 \
--config=/etc/pd/pd.toml \

if [[ -f /var/lib/pd/join ]]
    # The content of the join file is:
    #   demo-pd-0=http://demo-pd-0.demo-pd-peer.demo.svc:2380,demo-pd-1=http://demo-pd-1.demo-pd-peer.demo.svc:2380
    # The --join args must be:
    #   --join=http://demo-pd-0.demo-pd-peer.demo.svc:2380,http://demo-pd-1.demo-pd-peer.demo.svc:2380
    join=`cat /var/lib/pd/join | tr "," "\n" | awk -F'=' '{print $2}' | tr "\n" ","`
    ARGS="${ARGS} --join=${join}"
elif [[ ! -d /var/lib/pd/member/wal ]]
    until result=$(wget -qO- -T 3 http://${discovery_url}/new/${encoded_domain_url} 2>/dev/null); do
        echo "waiting for discovery service to return start args ..."
        sleep $((RANDOM % 5))

echo "starting pd-server ..."
sleep $((RANDOM % 10))
echo "/pd-server ${ARGS}"
exec /pd-server ${ARGS}

My questions are as follows:

  1. Does tidb-operator deploy the pd cluster dynamically? In other words, does it start with a single node representing a seed cluster and then create new PD nodes to join the seed cluster?
  2. If it is a dynamic deployment, how does tidb-operator identify the first node that starts? For the etcd-operator, the startup command parameters of the first node contain --initial-cluster-state=new, indicating it is a seed member. Subsequent nodes have --initial-cluster-state=existing in their command parameters, indicating they are to join a seed cluster.
  3. Why use dynamic deployment? I think dynamic deployment has the following disadvantages:

i. Dynamic deployment expands the number of cluster nodes from 1 to n through membership changes. When n is less than 3, the consensus protocol does not meet the requirements for the number of nodes.

ii. Dynamic deployment requires waiting for a consensus to be reached with each added node, which is slower than static deployment.

For static deployment, it is often necessary to determine the network topology between nodes in advance. We can solve this problem with DNS names. Suppose the three PD nodes I want to start are known in advance to have DNS names demo-pd-0.demo-pd-peer.demo.svc:2380, demo-pd-1.demo-pd-peer.demo.svc:2380, and demo-pd-2.demo-pd-peer.demo.svc:2380. Thus, the startup command for the three pods can be unified as /pd-server ... --initial-cluster demo-pd-0=http://demo-pd-0.demo-pd-peer.demo.svc:2380,demo-pd-1=http://demo-pd-1.demo-pd-peer.demo.svc:2380,demo-pd-2=http://demo-pd-2.demo-pd-peer.demo.svc:2380 ..... This way, there is no need to rely on a consensus algorithm.

  1. How is the file /var/lib/pd/join generated? Where does the content of the file come from?

Phoenix500526 avatar Nov 08 '23 07:11 Phoenix500526

TiDB Operator relies on an internal component named "discovery" to dynamically bootstrap the PD cluster. https://github.com/pingcap/tidb-operator/tree/master/cmd/discovery

and the join file is generated by PD, ref https://github.com/tikv/pd/blob/2c07c241114fe9afabd9927ecbee61c4252f2d8e/server/join/join.go#L93

csuzhangxc avatar Nov 09 '23 06:11 csuzhangxc

It seems like pd itself reads the join file to do corresponding operations, why would the start script read it and add the --join args?

cjc7373 avatar May 14 '24 06:05 cjc7373

It seems like pd itself reads the join file to do corresponding operations, why would the start script read it and add the --join args?

PD wouldn't read the file if no join argument is set, ref https://github.com/tikv/pd/blob/2c07c241114fe9afabd9927ecbee61c4252f2d8e/server/join/join.go#L85-L87

csuzhangxc avatar May 14 '24 07:05 csuzhangxc