tidb-operator
tidb-operator copied to clipboard
How does the tidb-operator deploy a pd cluster?
Question
Hi everyone, I've been studying the tidb-operator code recently and I've encountered some issues. Even after reading related blogs, I'm still confused.
When deploying a pd cluster, the pdMemberManager will add mounted data volumes and mount paths to the corresponding StatefulSet template as follows:
func (pmm *pdMemberManager) getNewPDSetForTidbCluster(tc *v1alpha1.TidbCluster) (*apps.StatefulSet, error) {
...
volMounts := []corev1.VolumeMount{
annMount,
{Name: "config", ReadOnly: true, MountPath: "/etc/pd"},
{Name: "startup-script", ReadOnly: true, MountPath: "/usr/local/bin"},
{Name: v1alpha1.PDMemberType.String(), MountPath: "/var/lib/pd"},
}
vols := []corev1.Volume{
annVolume,
{Name: "config",
VolumeSource: corev1.VolumeSource{
ConfigMap: &corev1.ConfigMapVolumeSource{
LocalObjectReference: corev1.LocalObjectReference{
Name: pdConfigMap,
},
Items: []corev1.KeyToPath{{Key: "config-file", Path: "pd.toml"}},
},
},
},
{Name: "startup-script",
VolumeSource: corev1.VolumeSource{
ConfigMap: &corev1.ConfigMapVolumeSource{
LocalObjectReference: corev1.LocalObjectReference{
Name: pdConfigMap,
},
Items: []corev1.KeyToPath{{Key: "startup-script", Path: "pd_start_script.sh"}},
},
},
},
}
...
pdSet := &apps.StatefulSet{
...
Spec: apps.StatefulSetSpec{
Replicas: func() *int32 { r := tc.Spec.PD.Replicas + int32(failureReplicas); return &r }(),
Selector: pdLabel.LabelSelector(),
Template: corev1.PodTemplateSpec{
ObjectMeta: metav1.ObjectMeta{
Labels: pdLabel.Labels(),
Annotations: podAnnotations,
},
Spec: corev1.PodSpec{
SchedulerName: tc.Spec.SchedulerName,
Affinity: tc.Spec.PD.Affinity,
NodeSelector: tc.Spec.PD.NodeSelector,
Containers: []corev1.Container{
{
...
Image: tc.Spec.PD.Image,
Command: []string{"/bin/sh", "/usr/local/bin/pd_start_script.sh"},
ImagePullPolicy: tc.Spec.PD.ImagePullPolicy,
Ports: []corev1.ContainerPort{
{
Name: "server",
ContainerPort: int32(2380),
Protocol: corev1.ProtocolTCP,
},
{
Name: "client",
ContainerPort: int32(2379),
Protocol: corev1.ProtocolTCP,
},
},
VolumeMounts: volMounts,
...
}
...
}
...
}
...
}
When a PD pod starts, it first executes /usr/local/bin/pd_start_script.sh, as follows:
#!/bin/sh
# This script is used to start pd containers in kubernetes cluster
# Use DownwardAPIVolumeFiles to store informations of the cluster:
# https://kubernetes.io/docs/tasks/inject-data-application/downward-api-volume-expose-pod-information/#the-downward-api
#
# runmode="normal/debug"
#
set -uo pipefail
ANNOTATIONS="/etc/podinfo/annotations"
if [[ ! -f "${ANNOTATIONS}" ]]
then
echo "${ANNOTATIONS} does't exist, exiting."
exit 1
fi
source ${ANNOTATIONS} 2>/dev/null
runmode=${runmode:-normal}
if [[ X${runmode} == Xdebug ]]
then
echo "entering debug mode."
tail -f /dev/null
fi
# the general form of variable PEER_SERVICE_NAME is: "<clusterName>-pd-peer"
cluster_name=`echo ${PEER_SERVICE_NAME} | sed 's/-pd-peer//'`
domain="${HOSTNAME}.${PEER_SERVICE_NAME}.${NAMESPACE}.svc"
discovery_url="${cluster_name}-discovery.${NAMESPACE}.svc:10261"
encoded_domain_url=`echo ${domain}:2380 | base64 | tr "\n" " " | sed "s/ //g"`
elapseTime=0
period=1
threshold=30
while true; do
sleep ${period}
elapseTime=$(( elapseTime+period ))
if [[ ${elapseTime} -ge ${threshold} ]]
then
echo "waiting for pd cluster ready timeout" >&2
exit 1
fi
if nslookup ${domain} 2>/dev/null
then
echo "nslookup domain ${domain}.svc success"
break
else
echo "nslookup domain ${domain} failed" >&2
fi
done
# The content of /etc/pd/pd.toml is as follows:
# [log]
# level = "info"
# [replication]
# location-labels = ["region", "zone", "rack", "host"]
ARGS="--data-dir=/var/lib/pd \
--name=${HOSTNAME} \
--peer-urls=http://0.0.0.0:2380 \
--advertise-peer-urls=http://${domain}:2380 \
--client-urls=http://0.0.0.0:2379 \
--advertise-client-urls=http://${domain}:2379 \
--config=/etc/pd/pd.toml \
"
if [[ -f /var/lib/pd/join ]]
then
# The content of the join file is:
# demo-pd-0=http://demo-pd-0.demo-pd-peer.demo.svc:2380,demo-pd-1=http://demo-pd-1.demo-pd-peer.demo.svc:2380
# The --join args must be:
# --join=http://demo-pd-0.demo-pd-peer.demo.svc:2380,http://demo-pd-1.demo-pd-peer.demo.svc:2380
join=`cat /var/lib/pd/join | tr "," "\n" | awk -F'=' '{print $2}' | tr "\n" ","`
join=${join%,}
ARGS="${ARGS} --join=${join}"
elif [[ ! -d /var/lib/pd/member/wal ]]
then
until result=$(wget -qO- -T 3 http://${discovery_url}/new/${encoded_domain_url} 2>/dev/null); do
echo "waiting for discovery service to return start args ..."
sleep $((RANDOM % 5))
done
ARGS="${ARGS}${result}"
fi
echo "starting pd-server ..."
sleep $((RANDOM % 10))
echo "/pd-server ${ARGS}"
exec /pd-server ${ARGS}
My questions are as follows:
- Does tidb-operator deploy the pd cluster dynamically? In other words, does it start with a single node representing a seed cluster and then create new PD nodes to join the seed cluster?
- If it is a dynamic deployment, how does tidb-operator identify the first node that starts? For the etcd-operator, the startup command parameters of the first node contain
--initial-cluster-state=new
, indicating it is a seed member. Subsequent nodes have--initial-cluster-state=existing
in their command parameters, indicating they are to join a seed cluster. - Why use dynamic deployment? I think dynamic deployment has the following disadvantages:
i. Dynamic deployment expands the number of cluster nodes from 1 to n through membership changes. When n is less than 3, the consensus protocol does not meet the requirements for the number of nodes.
ii. Dynamic deployment requires waiting for a consensus to be reached with each added node, which is slower than static deployment.
For static deployment, it is often necessary to determine the network topology between nodes in advance. We can solve this problem with DNS names. Suppose the three PD nodes I want to start are known in advance to have DNS names demo-pd-0.demo-pd-peer.demo.svc:2380
, demo-pd-1.demo-pd-peer.demo.svc:2380
, and demo-pd-2.demo-pd-peer.demo.svc:2380
. Thus, the startup command for the three pods can be unified as /pd-server ... --initial-cluster demo-pd-0=http://demo-pd-0.demo-pd-peer.demo.svc:2380,demo-pd-1=http://demo-pd-1.demo-pd-peer.demo.svc:2380,demo-pd-2=http://demo-pd-2.demo-pd-peer.demo.svc:2380 ....
. This way, there is no need to rely on a consensus algorithm.
- How is the file /var/lib/pd/join generated? Where does the content of the file come from?
TiDB Operator relies on an internal component named "discovery" to dynamically bootstrap the PD cluster. https://github.com/pingcap/tidb-operator/tree/master/cmd/discovery
and the join
file is generated by PD, ref https://github.com/tikv/pd/blob/2c07c241114fe9afabd9927ecbee61c4252f2d8e/server/join/join.go#L93
It seems like pd itself reads the join
file to do corresponding operations, why would the start script read it and add the --join
args?
It seems like pd itself reads the
join
file to do corresponding operations, why would the start script read it and add the--join
args?
PD wouldn't read the file if no join
argument is set, ref https://github.com/tikv/pd/blob/2c07c241114fe9afabd9927ecbee61c4252f2d8e/server/join/join.go#L85-L87