Changing system time during pod creation causes init containers to run in incorrect order
Issue Description
When creating init containers in a pod using podman kube play, they are created in the order specified in the YAML spec. When they are run, the are run in order of creation time (oldest first, newest last). However, in rare circumstances, these orders are not the same. If the system time changes during pod creation, for example a system service pod starts up concurrently with the system time daemon, the system time can change while init containers are being created. If the system time is sent into the past, then the later init containers could have creation times after the earlier init containers, causing them to run first.
This is a big problem if the init containers have to run in order in order to behave correctly.
Here is where containers get their creation time set - https://github.com/containers/podman/blob/afe55cded062ab0c56f57e99002686862f3327c9/libpod/runtime_ctr.go#L212 Here is where containers are retrieved from podman state in creation time order - https://github.com/containers/podman/blob/67bbbb9e94a00a8b5d1d358dfcc8bbd1bd0c9b55/libpod/pod.go#L502-L518 Here is where containers are started in retrieval order - https://github.com/containers/podman/blob/afe55cded062ab0c56f57e99002686862f3327c9/libpod/pod_api.go#L20-L27
Steps to reproduce the issue
Send system time backwards while concurrently creating a pod with multiple init containers.
Describe the results you received
Here is an example:
- podman kube play a pod with 4 init containers (A, B, C, D) at 3:00pm
- init container A is created at 3:00:01pm
- init container B is created at 3:00:02pm
- ntpd adjusts system time to 2:00pm
- init container C is created at 2:00:01pm
- init container D is created at 2:00:02pm
- libpod fetches the init containers, ordered by creation time
- libpod runs init container C (created at 2:00:01pm)
- libpod runs init container D (created at 2:00:02pm)
- libpod runs init container A (created at 3:00:01pm)
- libpod runs init container B (created at 3:00:02pm)
Describe the results you expected
init containers should run in the specified order - A, then B, then, C, then D
podman info output
[root@nms70 ~]# podman info
host:
arch: amd64
buildahVersion: 1.33.7
cgroupControllers:
- cpuset
- cpu
- cpuacct
- blkio
- memory
- devices
- freezer
- net_cls
- perf_event
- net_prio
- hugetlb
- pids
- rdma
cgroupManager: systemd
cgroupVersion: v1
conmon:
package: conmon-2.1.10-1.module+el8.10.0+21077+98b84d8a.x86_64
path: /usr/bin/conmon
version: 'conmon version 2.1.10, commit: 80c4f656297773fb630a4d966add3242abab39a4'
cpuUtilization:
idlePercent: 87.09
systemPercent: 5.32
userPercent: 7.59
cpus: 2
databaseBackend: sqlite
distribution:
distribution: rhel
version: "8.10"
eventLogger: file
freeLocks: 2000
hostname: nms70
idMappings:
gidmap: null
uidmap: null
kernel: 4.18.0-553.el8_10.x86_64
linkmode: dynamic
logDriver: k8s-file
memFree: 1397669888
memTotal: 8071610368
networkBackend: cni
networkBackendInfo:
backend: cni
dns:
package: podman-plugins-4.9.4-1.module+el8.10.0+21632+761e0d34.x86_64
path: /usr/libexec/cni/dnsname
version: |-
CNI dnsname plugin
version: 1.4.0-dev
commit: unknown
CNI protocol versions supported: 0.1.0, 0.2.0, 0.3.0, 0.3.1, 0.4.0, 1.0.0
package: containernetworking-plugins-1.4.0-2.module+el8.10.0+21366+f9cb49f8.x86_64
path: /usr/libexec/cni
ociRuntime:
name: runc
package: runc-1.1.12-1.module+el8.10.0+21251+62b7388c.x86_64
path: /usr/bin/runc
version: |-
runc version 1.1.12
spec: 1.0.2-dev
go: go1.21.3
libseccomp: 2.5.2
os: linux
pasta:
executable: ""
package: ""
version: ""
remoteSocket:
exists: true
path: /run/podman/podman.sock
security:
apparmorEnabled: false
capabilities: CAP_NET_RAW,CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
rootless: false
seccompEnabled: true
seccompProfilePath: /usr/share/containers/seccomp.json
selinuxEnabled: false
serviceIsRemote: false
slirp4netns:
executable: /usr/bin/slirp4netns
package: slirp4netns-1.2.3-1.module+el8.10.0+21306+6be40ce7.x86_64
version: |-
slirp4netns version 1.2.3
commit: c22fde291bb35b354e6ca44d13be181c76a0a432
libslirp: 4.4.0
SLIRP_CONFIG_VERSION_MAX: 3
libseccomp: 2.5.2
swapFree: 4294963200
swapTotal: 4294963200
uptime: 4h 7m 47.00s (Approximately 0.17 days)
variant: ""
plugins:
log:
- k8s-file
- none
- passthrough
- journald
network:
- bridge
- macvlan
- ipvlan
volume:
- local
registries:
search:
- registry.access.redhat.com
- registry.redhat.io
- docker.io
store:
configFile: /etc/containers/storage.conf
containerStore:
number: 25
paused: 0
running: 25
stopped: 0
graphDriverName: overlay
graphOptions:
overlay.mountopt: nodev,metacopy=on
graphRoot: /var/lib/containers/storage
graphRootAllocated: 161949396992
graphRootUsed: 20610424832
graphStatus:
Backing Filesystem: xfs
Native Overlay Diff: "false"
Supports d_type: "true"
Supports shifting: "true"
Supports volatile: "true"
Using metacopy: "false"
imageCopyTmpDir: /var/tmp
imageStore:
number: 25
runRoot: /run/containers/storage
transientStore: false
volumePath: /var/lib/containers/storage/volumes
version:
APIVersion: 4.9.4-rhel
Built: 1711986940
BuiltTime: Mon Apr 1 15:55:40 2024
GitCommit: ""
GoVersion: go1.21.7 (Red Hat 1.21.7-1.module+el8.10.0+21318+5ea197f8)
Os: linux
OsArch: linux/amd64
Version: 4.9.4-rhel
Podman in a container
No
Privileged Or Rootless
Privileged
Upstream Latest Release
No
Additional environment details
This happened in a vCenter virtualization environment where VMs boot with a time that was slightly too far in the future, and the time is corrected by ntpd during pod creation. Doesn't happen every time, just some. When this happens a critical service running in a podman pod does not start correctly because critical functions performed by init containers are performed out of order.
Additional information
No response
@baude, should we store each init container's position in the spec definition and use that to sort the init containers? Otherwise, I don't see any immediate way to do it with what we currently have.
Yes I think the only way is to store the order explicitly, maybe we can hack it into the annotations or we add a new field to the container config.
Although I think we use the time for much more things, for off display in podman ps/inspect. But also to calculate avg cpu based on it for things like stats so there are quite a few things that can be wrong if the time is changed.
A friendly reminder that this issue had no activity for 30 days.