tidb-operator How does the tidb-operator deploy a pd cluster?

Question

Hi everyone, I've been studying the tidb-operator code recently and I've encountered some issues. Even after reading related blogs, I'm still confused.

When deploying a pd cluster, the pdMemberManager will add mounted data volumes and mount paths to the corresponding StatefulSet template as follows:

func (pmm *pdMemberManager) getNewPDSetForTidbCluster(tc *v1alpha1.TidbCluster) (*apps.StatefulSet, error) {
        ...
        volMounts := []corev1.VolumeMount{
                annMount,
                {Name: "config", ReadOnly: true, MountPath: "/etc/pd"},
                {Name: "startup-script", ReadOnly: true, MountPath: "/usr/local/bin"},
                {Name: v1alpha1.PDMemberType.String(), MountPath: "/var/lib/pd"},
        }
        vols := []corev1.Volume{
                annVolume,
                {Name: "config",
                        VolumeSource: corev1.VolumeSource{
                                ConfigMap: &corev1.ConfigMapVolumeSource{
                                        LocalObjectReference: corev1.LocalObjectReference{
                                                Name: pdConfigMap,
                                        },
                                        Items: []corev1.KeyToPath{{Key: "config-file", Path: "pd.toml"}},
                                },
                        },
                },
                {Name: "startup-script",
                        VolumeSource: corev1.VolumeSource{
                                ConfigMap: &corev1.ConfigMapVolumeSource{
                                        LocalObjectReference: corev1.LocalObjectReference{
                                                Name: pdConfigMap,
                                        },
                                        Items: []corev1.KeyToPath{{Key: "startup-script", Path: "pd_start_script.sh"}},
                                },
                        },
                },
        }

        ...
        pdSet := &apps.StatefulSet{
                ...
                Spec: apps.StatefulSetSpec{
                        Replicas: func() *int32 { r := tc.Spec.PD.Replicas + int32(failureReplicas); return &r }(),
                        Selector: pdLabel.LabelSelector(),
                        Template: corev1.PodTemplateSpec{
                                ObjectMeta: metav1.ObjectMeta{
                                        Labels:      pdLabel.Labels(),
                                        Annotations: podAnnotations,
                                },
                                Spec: corev1.PodSpec{
                                        SchedulerName: tc.Spec.SchedulerName,
                                        Affinity:      tc.Spec.PD.Affinity,
                                        NodeSelector:  tc.Spec.PD.NodeSelector,
                                        Containers: []corev1.Container{
                                                {
                                                        ...
                                                        Image:           tc.Spec.PD.Image,
                                                        Command:         []string{"/bin/sh", "/usr/local/bin/pd_start_script.sh"},
                                                        ImagePullPolicy: tc.Spec.PD.ImagePullPolicy,
                                                        Ports: []corev1.ContainerPort{
                                                                {
                                                                        Name:          "server",
                                                                        ContainerPort: int32(2380),
                                                                        Protocol:      corev1.ProtocolTCP,
                                                                },
                                                                {
                                                                        Name:          "client",
                                                                        ContainerPort: int32(2379),
                                                                        Protocol:      corev1.ProtocolTCP,
                                                                },
                                                        },
                                                        VolumeMounts: volMounts,
                                                        ...
                                                }
                                      ...
                                }
                                ...
                        }
    ...
 }

When a PD pod starts, it first executes /usr/local/bin/pd_start_script.sh, as follows:

#!/bin/sh

# This script is used to start pd containers in kubernetes cluster

# Use DownwardAPIVolumeFiles to store informations of the cluster:
# https://kubernetes.io/docs/tasks/inject-data-application/downward-api-volume-expose-pod-information/#the-downward-api
#
#   runmode="normal/debug"
#

set -uo pipefail

ANNOTATIONS="/etc/podinfo/annotations"

if [[ ! -f "${ANNOTATIONS}" ]]
then
    echo "${ANNOTATIONS} does't exist, exiting."
    exit 1
fi
source ${ANNOTATIONS} 2>/dev/null

runmode=${runmode:-normal}
if [[ X${runmode} == Xdebug ]]
then
    echo "entering debug mode."
    tail -f /dev/null
fi

# the general form of variable PEER_SERVICE_NAME is: "<clusterName>-pd-peer"
cluster_name=`echo ${PEER_SERVICE_NAME} | sed 's/-pd-peer//'`
domain="${HOSTNAME}.${PEER_SERVICE_NAME}.${NAMESPACE}.svc"
discovery_url="${cluster_name}-discovery.${NAMESPACE}.svc:10261"
encoded_domain_url=`echo ${domain}:2380 | base64 | tr "\n" " " | sed "s/ //g"`

elapseTime=0
period=1
threshold=30
while true; do
    sleep ${period}
    elapseTime=$(( elapseTime+period ))

    if [[ ${elapseTime} -ge ${threshold} ]]
    then
        echo "waiting for pd cluster ready timeout" >&2
        exit 1
    fi

    if nslookup ${domain} 2>/dev/null
    then
        echo "nslookup domain ${domain}.svc success"
        break
    else
        echo "nslookup domain ${domain} failed" >&2
    fi
done

# The content of /etc/pd/pd.toml is as follows:
#    [log]
#    level = "info"
#    [replication]
#    location-labels = ["region", "zone", "rack", "host"]
ARGS="--data-dir=/var/lib/pd \
--name=${HOSTNAME} \
--peer-urls=http://0.0.0.0:2380 \
--advertise-peer-urls=http://${domain}:2380 \
--client-urls=http://0.0.0.0:2379 \
--advertise-client-urls=http://${domain}:2379 \
--config=/etc/pd/pd.toml \
"

if [[ -f /var/lib/pd/join ]]
then
    # The content of the join file is:
    #   demo-pd-0=http://demo-pd-0.demo-pd-peer.demo.svc:2380,demo-pd-1=http://demo-pd-1.demo-pd-peer.demo.svc:2380
    # The --join args must be:
    #   --join=http://demo-pd-0.demo-pd-peer.demo.svc:2380,http://demo-pd-1.demo-pd-peer.demo.svc:2380
    join=`cat /var/lib/pd/join | tr "," "\n" | awk -F'=' '{print $2}' | tr "\n" ","`
    join=${join%,}
    ARGS="${ARGS} --join=${join}"
elif [[ ! -d /var/lib/pd/member/wal ]]
then
    until result=$(wget -qO- -T 3 http://${discovery_url}/new/${encoded_domain_url} 2>/dev/null); do
        echo "waiting for discovery service to return start args ..."
        sleep $((RANDOM % 5))
    done
    ARGS="${ARGS}${result}"
fi

echo "starting pd-server ..."
sleep $((RANDOM % 10))
echo "/pd-server ${ARGS}"
exec /pd-server ${ARGS}

My questions are as follows:

Does tidb-operator deploy the pd cluster dynamically? In other words, does it start with a single node representing a seed cluster and then create new PD nodes to join the seed cluster?
If it is a dynamic deployment, how does tidb-operator identify the first node that starts? For the etcd-operator, the startup command parameters of the first node contain --initial-cluster-state=new, indicating it is a seed member. Subsequent nodes have --initial-cluster-state=existing in their command parameters, indicating they are to join a seed cluster.
Why use dynamic deployment? I think dynamic deployment has the following disadvantages:

i. Dynamic deployment expands the number of cluster nodes from 1 to n through membership changes. When n is less than 3, the consensus protocol does not meet the requirements for the number of nodes.

ii. Dynamic deployment requires waiting for a consensus to be reached with each added node, which is slower than static deployment.

For static deployment, it is often necessary to determine the network topology between nodes in advance. We can solve this problem with DNS names. Suppose the three PD nodes I want to start are known in advance to have DNS names demo-pd-0.demo-pd-peer.demo.svc:2380, demo-pd-1.demo-pd-peer.demo.svc:2380, and demo-pd-2.demo-pd-peer.demo.svc:2380. Thus, the startup command for the three pods can be unified as /pd-server ... --initial-cluster demo-pd-0=http://demo-pd-0.demo-pd-peer.demo.svc:2380,demo-pd-1=http://demo-pd-1.demo-pd-peer.demo.svc:2380,demo-pd-2=http://demo-pd-2.demo-pd-peer.demo.svc:2380 ..... This way, there is no need to rely on a consensus algorithm.

How is the file /var/lib/pd/join generated? Where does the content of the file come from?

Nov 08 '23 07:11 Phoenix500526

TiDB Operator relies on an internal component named "discovery" to dynamically bootstrap the PD cluster. https://github.com/pingcap/tidb-operator/tree/master/cmd/discovery

and the join file is generated by PD, ref https://github.com/tikv/pd/blob/2c07c241114fe9afabd9927ecbee61c4252f2d8e/server/join/join.go#L93

Nov 09 '23 06:11 csuzhangxc

It seems like pd itself reads the join file to do corresponding operations, why would the start script read it and add the --join args?

May 14 '24 06:05 cjc7373

It seems like pd itself reads the join file to do corresponding operations, why would the start script read it and add the --join args?

PD wouldn't read the file if no join argument is set, ref https://github.com/tikv/pd/blob/2c07c241114fe9afabd9927ecbee61c4252f2d8e/server/join/join.go#L85-L87

May 14 '24 07:05 csuzhangxc

tidb-operator tidb-operator copied to clipboard

How does the tidb-operator deploy a pd cluster?

Question

tidb-operator
tidb-operator copied to clipboard