spegel
spegel copied to clipboard
Removal of certs in containerd configuration causing 500 error (mirror resolve retries exhausted for key)
Spegel version
v0.0.19
Kubernetes distribution
EKS
Kubernetes version
v1.27.0
CNI
VPC CNI
Describe the bug
We are using jfrog artifactory registry as our registry. we are using tls certificate authentication for resolving and accessing our registry endpoints. We were facing 500 error with mirror resolve retries exhausted for key while running a pod, it gave this error
│ Type Reason Age From Message │
│ ---- ------ ---- ---- ------- │
│ Normal Scheduled 29m default-scheduler Successfully assigned large-pod-1 to ip-10-190-18-201.eu-west-1.compute.internal │
│ Normal Pulling 27m (x4 over 29m) kubelet Pulling image "xxx" │
│ Warning Failed 27m (x4 over 29m) kubelet Failed to pull image "xxx": rpc error: code = Unknown desc = failed to pull and unpack image "xxx": failed to resolve reference "xxx": unexpected status from HEAD request to http://10.190.18.201:30020/v2/xxx 500 Internal Server Error │
│ Warning Failed 27m (x4 over 29m) kubelet Error: ErrImagePull │
│ Warning Failed 27m (x6 over 29m) kubelet Error: ImagePullBackOff │
│ Normal BackOff 4m6s (x109 over 29m) kubelet Back-off pulling image xxx" │
│
On diving deep we saw that our client.cert and client.key files are not loaded in the configuration
configuration {"level":"info","ts":1712209973.4201605,"caller":"oci/containerd.go:511","msg":"backing up Containerd host configuration","path":"/etc/containerd/certs.d/xxx"} │
│ configuration {"level":"info","ts":1712209973.420308,"caller":"oci/containerd.go:564","msg":"added containerd mirror configuration","registry":"https://xxx","path":"/etc/containerd/certs.d/xxx/hosts.toml" │
│ configuration {"level":"info","ts":1712209973.4203854,"caller":"oci/containerd.go:564","msg":"added containerd mirror configuration","registry":"xxx","path":"/etc/containerd/certs.d/xxx/hosts.toml" │
│ configuration {"level":"info","ts":1712209973.4204602,"caller":"oci/containerd.go:564","msg":"added containerd mirror configuration","registry":"xxx","path":"/etc/containerd/certs.d/xxx/hosts.toml"} │
│ configuration {"level":"info","ts":1712209973.4205267,"caller":"build/main.go:87","msg":"gracefully shutdown"}
only host.toml seems to be loaded to spegel. Therefore not being able to access artifactory and pull.
Sharing the spegel log file for reference. Can you please ensure that the all the files in cert configuration is copied in the spegel configuration?
@IndhumithaR I understand the problem that you are in, as Spegel will override any configuration. I do agree that this should be fixed.
Could you share an example hosts.toml that you are using, which can be used in a unit test?
Hello @phillebaba thanks for quick turnaround on this.
Actually we don't have a hosts.toml file. It gets created automatically with hosts on 30020 and 30021 with local ipv4 of the VM.
hosts.toml
[host."https://<LOCAL_VM_IPV4>:30020"]
capabilities = ["pull", "resolve"]
[host."https://<LOCAL_VM_IPV4>:30021"]
capabilities = ["pull", "resolve"]
But while researching we did find an interesting way to workaround the issue above by actually creating one hosts.toml file and adding client flag within the host configuration. But we are facing issues with it. When just adding
client = ["../../client.cert", "../../client.key"]
we get an error -
failed to load X509 key pair: tls: failed to find certificate PEM data in certificate input, but did find a private key; PEM inputs may have been switched
When we add it as
client = [["../../client.cert", "../../client.key"],[""]]
OR
client = [["../../client.cert", "../../client.key"],["",""]]
OR
client = [["../../client.cert"], ["../../client.key"]]
then we get
xxx": dial tcp 10.190.18.201:30020: connect: no route to host
which means it still doesn't get connected.
But it would be far better if the whole configuration of the mirror gets copied the way we add it in certs.d configuration instead of just hosts.toml getting created.
Hi @phillebaba, As Badal said we did a workaround by creating one hosts.toml file and adding client flag within the host configuration.
Here is the sample Hosts.toml for your information,
mkdir -p /etc/containerd/certs.d/xxx/
cat > /etc/containerd/certs.d/xxx/hosts.toml << EOL
server = 'https://xxx'
[host]
[host.'http://${ipv4}:30020']
capabilities = ['pull', 'resolve']
client = ["/etc/certs/xxx/client.cert", "/etc/certs/xxx/client.key"]
[host.'http://${ipv4}:30021']
capabilities = ['pull', 'resolve']
client = ["/etc/certs/xxx/client.cert", "/etc/certs/xxx/client.key"]
EOL
Also, I think it would be better that instead of creating a new hosts.toml file, it would be better if we copy it if its already present. Is it currently the case?
Yes I think the solution is to modify existing toml files instead of replacing them.
The current implementation is slightly optimistic when it comes to writing mirror configuration. To avoid any issues I thought it was best to completely replace any configuration. I do see however the use case to append to existing configuration.
A good solution would be to add a new configuration so users can decide whether to keep the existing behavior or use the new one. We can then default it to the current behavior and that way you can enable appending to mirror configuration. Does that sound good?
Yes, I think it should work. We can try this approach.
Yes I think the solution is to modify existing toml files instead of replacing them.
The current implementation is slightly optimistic when it comes to writing mirror configuration. To avoid any issues I thought it was best to completely replace any configuration. I do see however the use case to append to existing configuration.
A good solution would be to add a new configuration so users can decide whether to keep the existing behavior or use the new one. We can then default it to the current behavior and that way you can enable appending to mirror configuration. Does that sound good?
Yes. That would be great. Also, not just add hosts.toml files but whatever files are in configuration like client.key and client.cert to the configuration and load it. Also it would be great that we add a print statement in logs for this and the way we are saying
configuration {"level":"info","ts":1712209973.4201605,"caller":"oci/containerd.go:511","msg":"backing up Containerd host configuration","path":"/etc/containerd/certs.d/xxx"} │
│ configuration {"level":"info","ts":1712209973.420308,"caller":"oci/containerd.go:564","msg":"added containerd mirror configuration","registry":"https://xxx","path":"/etc/containerd/certs.d/xxx/hosts.toml" │
│ configuration {"level":"info","ts":1712209973.4203854,"caller":"oci/containerd.go:564","msg":"added containerd mirror configuration","registry":"xxx","path":"/etc/containerd/certs.d/xxx/hosts.toml" │
│ configuration {"level":"info","ts":1712209973.4204602,"caller":"oci/containerd.go:564","msg":"added containerd mirror configuration","registry":"xxx","path":"/etc/containerd/certs.d/xxx/hosts.toml"} │
│ configuration {"level":"info","ts":1712209973.4205267,"caller":"build/main.go:87","msg":"gracefully shutdown"}
we should have it printed as added containerd mirror configuration with all the paths of all the files so that when spegel starts we know what all mirrors and what all files are loaded and added to the configuration. In this if hosts.toml file is present then overwrite. But if its not present, create it.
@badaldavda8 @IndhumithaR
I have created #424 that should resolve this issue.
For clarification of the change it adds a new flag --append-mirrors
which when enabled will append to existing mirror configurations. Give the existing mirror configuration for docker.io on the node.
server = 'https://registry-1.docker.io'
[host]
[host.'http://example.com:30020']
capabilities = ['pull', 'resolve']
client = ['/etc/certs/xxx/client.cert', '/etc/certs/xxx/client.key']
[host.'http://example.com:30021']
capabilities = ['pull', 'resolve']
client = ['/etc/certs/xxx/client.cert', '/etc/certs/xxx/client.key']
It will result in the following configuration after Spegel has run. In this example we are just adding a single mirror located at localhost:5000
.
server = 'https://registry-1.docker.io'
[host]
[host.'http://127.0.0.1:5000']
capabilities = ['pull', 'resolve']
[host.'http://example.com:30020']
client = ['/etc/certs/xxx/client.cert', '/etc/certs/xxx/client.key']
capabilities = ['pull', 'resolve']
[host.'http://example.com:30021']
client = ['/etc/certs/xxx/client.cert', '/etc/certs/xxx/client.key']
capabilities = ['pull', 'resolve']
Does this match the needs you have to solve the problem?
@badaldavda8 @IndhumithaR
I have created #424 that should resolve this issue.
For clarification of the change it adds a new flag
--append-mirrors
which when enabled will append to existing mirror configurations. Give the existing mirror configuration for docker.io on the node.server = 'https://registry-1.docker.io' [host] [host.'http://example.com:30020'] capabilities = ['pull', 'resolve'] client = ['/etc/certs/xxx/client.cert', '/etc/certs/xxx/client.key'] [host.'http://example.com:30021'] capabilities = ['pull', 'resolve'] client = ['/etc/certs/xxx/client.cert', '/etc/certs/xxx/client.key']
It will result in the following configuration after Spegel has run. In this example we are just adding a single mirror located at
localhost:5000
.server = 'https://registry-1.docker.io' [host] [host.'http://127.0.0.1:5000'] capabilities = ['pull', 'resolve'] [host.'http://example.com:30020'] client = ['/etc/certs/xxx/client.cert', '/etc/certs/xxx/client.key'] capabilities = ['pull', 'resolve'] [host.'http://example.com:30021'] client = ['/etc/certs/xxx/client.cert', '/etc/certs/xxx/client.key'] capabilities = ['pull', 'resolve']
Does this match the needs you have to solve the problem?
Hi @phillebaba, Yes this would solve our problem. And also it would be better if we would still have all our file like client.cert and client.key in the config path instead of moving them to the backup path.
It will great if you can give us an approximate ETA for fixing this issue?
So the PR should be ready to merge now, and I could probably cut a release if this is a blocker for you.
I am however thinking about the best way of keeping the certificates in the configuration directory. The question is how far to take this as we would need to determine which files should be kept. For this to work we would need some method of determining which directories are host configurations and which are not.
Would it be possible for you to place the certificates in a different directory like /etc/certs/client.pem
like they do in the documentation, rather than placing them in /etc/containerd/certs.d
?
Hello @phillebaba,
Thanks for adding the flag. But the problem that we are facing is that we are not sure exactly how do we add these certs in hosts.toml file and make it work. We were just attempting a workaround for that using hosts.toml.
It would be great if the whole directory is loaded with /etc/containerd/certs.d/xyz.com/client.cert
and /etc/containerd/certs.d/xyz.com/client.key
because this is how we have had it working until now.
We did try to add it as client as shared above even without spegel and it failed. Do you have any direction for that since containerd documentation is not clear on how do we add client.key and client.cert for tls authentication for each mirror.
We are still trying to add multiple configurations in containerd config.toml and hosts.toml. Will keep you updated on the issue as well.
One solution could be to copy over all of the file contents from the existing directories. To allow registry specific certs. I would rather avoid this if possible. If we were to add this I would probably add an additional flag for this behavior.
I had a look back at your previous examples and saw that you are using client cert pairs for authentication. Note that the Containerd documentation has different syntax for this compared to using a pem file. So its important that you configure them as key pairs.
client = [["/etc/certs/client.cert", "/etc/certs/client.key"]]
https://github.com/containerd/containerd/blob/main/docs/hosts.md#client-field
In the documentation they use absolute paths for both files, which seems to work fine even when the files are located in a different directory.
In theory you do not have to depend on Spegel to configure mirrors for you. As you are running EKS it is totally possible to just setup all of the mirror configuration in your custom bootstrap.sh file and then just disable Spegel mirror configuration.
I will go ahead and merge #424 and cut a release for that. If you are in need for more features to persist other files in the certs.d directory let me know. Will keep this issue open until things are resolved.
@badaldavda8 @IndhumithaR I have cut v0.0.21 now which contains an option to enable merging. As long as you are not placing certificates in certs.d directory I think this would work for you.
We will test this and get back to you!
Hi @phillebaba,
We tried using v0.0.21 and --append-mirrors
set to true.
Even then we noticed that the existing hosts.toml which we have is pushed to /etc/containerd/certs.d/_backup
and a new hosts.toml is created in /etc/containerd/certs.d/
Not sure, What's wrong.
That is the expected behavior. It is not really possible to not backup all of the mirror configuration. We need the original registries to be able to update the registry configuration at a later date. Without the backup we wont know what the original state is.
You should see that the new host files are basically the original merged with the new configuration for Spegel, which is what we discussed.
The mirror configuration in Spegel is mostly just helper logic for the purpose of simplifying operations for the majority of the user. As I suggested before. if your use case does not fit in with this solution you could disable Spegel mirror configuration and rely on your own method instead.
Just to make sure the new hosts.toml are correct right? With the merged content.
Hi @phillebaba We see that the hosts.toml is getting overwritten. It is not getting merged.
That is the expected behavior. It is not really possible to not backup all of the mirror configuration. We need the original registries to be able to update the registry configuration at a later date. Without the backup we wont know what the original state is.
You should see that the new host files are basically the original merged with the new configuration for Spegel, which is what we discussed.
The mirror configuration in Spegel is mostly just helper logic for the purpose of simplifying operations for the majority of the user. As I suggested before. if your use case does not fit in with this solution you could disable Spegel mirror configuration and rely on your own method instead.
Just to make sure the new hosts.toml are correct right? With the merged content.
How do you suggest that we disable spegel mirror configuration? Is it by using spegel.containerdMirrorAdd=false?
Yes if you set that to false you will disable the whole init container which writes mirror configuration.
Hi @phillebaba, We tried disabling spegel mirror configuration.
Now we are able to pull docker images but getting the below error often and after multiple retries the image is getting pulled.
│ code = Canceled desc = failed to pull and unpack image "xxx": failed to commit snapshot extract-975686076- │
│ HBR8 sha256:d2fc1ed9e02fb5a7bd0cc587214cf26bdb04e4daa5263c7ba5847266ed32ff6a: context canceled
Do you know why we are getting this error?
Hi @phillebaba,
We deployed a pod to pull a docker image, a new node is created and tried to pull the image. After several tries, finally the image is pulled in 10 mins. The below is the pod log.
Events: │
│ Type Reason Age From Message │
│ ---- ------ ---- ---- ------- │
│ Warning FailedScheduling 50m default-scheduler 0/2 nodes are available: 1 node(s) didn't match Pod's node affinity/selector, 1 node(s) had untolerated taint {karpenter.sh/disruption: disrupting}. preemption: 0/2 nodes are │
│ available: 2 Preemption is not helpful for scheduling.. │
│ Normal Nominated 50m karpenter Pod should schedule on: nodeclaim/cpu-node-pool-m74hm │
│ Normal Scheduled 48m default-scheduler Successfully assigned dtp/large-pod-3 to ip-10-190-46-51.eu-west-1.compute.internal │
│ Warning Failed 45m kubelet Failed to pull image "xxx:1.0": rpc error: code = Canceled desc = failed to pull and unpack ima │
│ ge "xxx:1.0": context canceled │
│ Warning Failed 43m kubelet Failed to pull image "xxx:1.0": rpc error: code = Canceled desc = failed to pull and unpack ima │
│ ge "xxx:1.0": failed to commit snapshot extract-787971535-l7S7 sha256:d2fc1ed9e02fb5a7bd0cc587214cf26bdb04e4daa5263c7ba5847266ed32ff6a: context canceled │
│ Warning Failed 39m kubelet Failed to pull image "xxx:1.0": rpc error: code = Canceled desc = failed to pull and unpack ima │
│ ge "xxx:1.0": failed to extract layer sha256:6779e674d34da348414644d25ffb0254c26178c88ae6854f5e1d9bc3719fab99: context canceled │
│ Warning Failed 39m (x3 over 45m) kubelet Error: ErrImagePull │
│ Normal BackOff 39m (x4 over 45m) kubelet Back-off pulling image "xxx:1.0" │
│ Warning Failed 39m (x4 over 45m) kubelet Error: ImagePullBackOff │
│ Normal Pulling 39m (x4 over 48m) kubelet Pulling image "xxx:1.0" │
│ Normal Pulled 28m kubelet Successfully pulled image "xxx:1.0" in 10m23.459933513s (10m23.459942873s including waiting) │
│ Normal Created 28m kubelet Created container large-container1 │
│ Normal Started 28m kubelet Started container large-container1
Spegel log for the node:
│ {"level":"info","ts":1713763231.1640038,"caller":"build/main.go:204","msg":"running Spegel","registry":":5000","router":":5001"} │
│ {"level":"info","ts":1713763231.164356,"caller":"state/state.go:30","msg":"running scheduled image state update"} │
│ {"level":"info","ts":1713763231.1648052,"logger":"p2p","caller":"routing/p2p.go:123","msg":"starting p2p router","id":"/ip4/10.190.39.116/tcp/5001/p2p/12D3KooWMryfM2RbusSZQbcK64LpRR6udkjKaavpyFBGFr3nxk9x"} │
│ {"level":"info","ts":1713763231.1651144,"caller":"leaderelection/leaderelection.go:250","msg":"attempting to acquire leader lease spegel/spegel-leader-election...\n"} │
│ {"level":"info","ts":1713763771.1649404,"caller":"state/state.go:30","msg":"running scheduled image state update"} │
│ {"level":"info","ts":1713764245.5022466,"caller":"state/state.go:39","msg":"received image event","image":"xxx:1.0@sha256:fb5e6a0466be22dd75e1d1202eb0658118bc200e │
│ {"level":"info","ts":1713764245.6634855,"caller":"state/state.go:39","msg":"received image event","image":"xxx:1.0@sha256:fb5e6a0466be22dd75e1d1202eb0658118bc200e │
│ {"level":"info","ts":1713764245.7156382,"caller":"state/state.go:39","msg":"received image event","image":"xxx@sha256:fb5e6a0466be22dd75e1d1202eb0658118bc200eabacad3ba4e │
│ {"level":"info","ts":1713764311.16439,"caller":"state/state.go:30","msg":"running scheduled image state update"} │
│ {"level":"info","ts":1713764851.1644804,"caller":"state/state.go:30","msg":"running scheduled image state update"} │
│ {"level":"info","ts":1713765391.164477,"caller":"state/state.go:30","msg":"running scheduled image state update"} │
│ {"level":"info","ts":1713765931.1649988,"caller":"state/state.go:30","msg":"running scheduled image state update"}
We deployed another pod which gets assigned to a new node. This pod also tries to pull the same image. Below are its log,
Events: │
│ Type Reason Age From Message │
│ ---- ------ ---- ---- ------- │
│ Normal Scheduled 21m default-scheduler Successfully assigned dtp/large-pod-2 to ip-10-190-41-236.eu-west-1.compute.internal │
│ Warning Failed 18m kubelet Failed to pull image "xxx:1.0": rpc error: code = Canceled desc = failed to pull and unpack image "art │
│xxx:1.0": context canceled │
│ Warning Failed 16m kubelet Failed to pull image "xxx:1.0": rpc error: code = Canceled desc = failed to pull and unpack image "art │
│ xx:1.0": failed to extract layer sha256:6779e674d34da348414644d25ffb0254c26178c88ae6854f5e1d9bc3719fab99: context canceled │
│ Warning Failed 13m (x3 over 18m) kubelet Error: ErrImagePull │
│ Warning Failed 13m kubelet Failed to pull image "xxx:1.0": rpc error: code = Canceled desc = failed to pull and unpack image "art │
│xxx:1.0": failed to commit snapshot extract-116295700--zVv sha256:d2fc1ed9e02fb5a7bd0cc587214cf26bdb04e4daa5263c7ba5847266ed32ff6a: context canceled │
│ Normal BackOff 12m (x5 over 18m) kubelet Back-off pulling image "xxx:1.0" │
│ Warning Failed 12m (x5 over 18m) kubelet Error: ImagePullBackOff │
│ Normal Pulling 12m (x4 over 21m) kubelet Pulling image "xxx:1.0" │
│ Normal Pulled 114s kubelet Successfully pulled image "xxx:1.0" in 10m22.187702651s (10m22.187717239s including waiting) │
│ Normal Created 114s kubelet Created container large-container1 │
│ Normal Started 114s kubelet Started container large-container1
Spegel log for 2nd node:
│ {"level":"info","ts":1713763055.7157514,"caller":"build/main.go:204","msg":"running Spegel","registry":":5000","router":":5001"} │
│ {"level":"info","ts":1713763055.716569,"caller":"state/state.go:30","msg":"running scheduled image state update"} │
│ {"level":"info","ts":1713763055.7168446,"logger":"p2p","caller":"routing/p2p.go:123","msg":"starting p2p router","id":"/ip4/10.190.36.182/tcp/5001/p2p/12D3KooWJJt6drajVNQfjWedNueQ8c9VszokqhjSTQcVdC9w6RAL"} │
│ {"level":"info","ts":1713763055.716939,"caller":"leaderelection/leaderelection.go:250","msg":"attempting to acquire leader lease spegel/spegel-leader-election...\n"} │
│ {"level":"info","ts":1713763595.716495,"caller":"state/state.go:30","msg":"running scheduled image state update"} │
│ {"level":"info","ts":1713764135.7163444,"caller":"state/state.go:30","msg":"running scheduled image state update"} │
│ {"level":"info","ts":1713764675.7167358,"caller":"state/state.go:30","msg":"running scheduled image state update"} │
│ {"level":"info","ts":1713765215.7165866,"caller":"state/state.go:30","msg":"running scheduled image state update"} │
│ {"level":"info","ts":1713765755.716464,"caller":"state/state.go:30","msg":"running scheduled image state update"} │
│ {"level":"info","ts":1713766106.3790374,"caller":"state/state.go:39","msg":"received image event","image":"xxx:1.0@sha256:fb5e6a0466be22dd75e1d1202eb0658118bc200e │
│ {"level":"info","ts":1713766106.4227855,"caller":"state/state.go:39","msg":"received image event","image":"xxx:1.0@sha256:fb5e6a0466be22dd75e1d1202eb0658118bc200e │
│ {"level":"info","ts":1713766106.4387383,"caller":"state/state.go:39","msg":"received image event","image":"xxx@sha256:fb5e6a0466be22dd75e1d1202eb0658118bc200eabacad3ba4e │
│ {"level":"info","ts":1713766295.716398,"caller":"state/state.go:30","msg":"running scheduled image state update"}
We see no improvement in image pull time of the first node and second node. According to our understanding of spegel, the first node will pull from artifactory and second node will pull the image from the first node. In this case, we expect some improvement in the image will time. But actually took the same time.
Even when we look at the spegel logs, it doesn't tell anything like it pulled the image from the first node. We need your help proper understanding of this.
FYI, while installing spegel, we set spegel.appendMirrors=true and set spegel.containerdMirrorAdd=false. We also set spegel.registries to our private registries.
We also tried increasing spegel.mirrorResolveRetries=4 and spegel.mirrorResolveTimeout="25s" it don't show any improvement.
Can you please explain the expected behavior or anything that we have missed out? Thanks
Hi @phillebaba and the spegel team,
Thank you for the excellent work on the spegel project. We've been trying to use it in our application, but we're facing an issue with the TLS configuration while parsing the hosts.toml
file.
Here are the details:
- We've provided the correct paths for the
client.cert
andclient.key
files in thehosts.toml
file. - We've verified that the certificate and private key files are in the correct PEM format, starting with
-----BEGIN CERTIFICATE-----
and-----BEGIN PRIVATE KEY-----
, respectively. - We've independently tested the
tls.X509KeyPair
function with the same certificate and private key files, and it works as expected. - However, when running the spegel library, we're encountering the following error:
failed to load X509 key pair: tls: found a certificate rather than a key in the PEM for the private key
. - This error seems to be occurring in the
hosts.go
file, specifically at line 224, where thetls.X509KeyPair
function is called. - Based on the Go standard library code (https://go.dev/src/crypto/tls/tls.go?s=5662:5729 - line 282), this error should only occur if the private key file contains "CERTIFICATE" instead of a valid private key.
- We've double-checked, and the private key file we're providing does not contain a certificate. It has the following format:
-----BEGIN PRIVATE KEY-----
<private_key_content>
-----END PRIVATE KEY-----
We're unsure why the spegel library is encountering this error when the standard Go TLS library works correctly with the same files.
We would greatly appreciate if you could investigate this issue and provide guidance on how we can resolve it. We've been trying to resolve this for the past two weeks without success, and we're eager to use the spegel project in our application.
Thank you in advance for your assistance.
Best regards, Badal
Also, if you can assist us by adding following in the debug log -
- hosts.toml file configuration that gets passed
- client.key and client.cert blocks getting passed
- Capability to preserve our hosts file. As shared by Indumitha, its not getting appended but overwritten currently even after adding the flag.
Now I am taking a wild guess but I think that the issue now is that you are passing the client certs to the wrong configuration and that the ordering might be wrong. As far as I understand the mirror configuration is that ordering matters, which could be a reason why you are not seeing any performance gains when things seem to work. The ordering should be top to bottom.
So when you are configuring the mirror configuration on your own it should look something like this. The first two entries are for Spegel. One will route to the local port and the other is a fallback to a random Spegel instance in the cluster. It isn't actually a requirement but a nice to have. Then after that comes the configuration for your artifactory registry with the client cert authentication.
server = 'https://artifactory-registry-url'
[host]
[host.'http://127.0.0.1:30020']
capabilities = ['pull', 'resolve']
[host.'http://127.0.0.1:30021']
capabilities = ['pull', 'resolve']
[host.'https://artifactory-registry-url']
client = ['/etc/certs/xxx/client.cert', '/etc/certs/xxx/client.key']
capabilities = ['pull', 'resolve']
I realize that I probably misled you in my examples. That is my fault I did not really think through the situation. I will probably add some examples to the docs to explain how mirror configuration needs to look like.
Does this make sense or have I gotten more confused about the actual issue?
FYI I have not tested this but I assume this is the expected configuration for Containerd. Basically we are modifying the registry configuration to use client certs only for the artifactory registry. As far as I can read the Containerd documentation I cannot find another may to solve this.
@phillebaba, Thanks for you response.
This is the hosts.toml, we are currently using.
server = 'https://artifact.xxx.com'
client = [['/etc/certs/artifact.xxx.com/client.cert','/etc/certs/artifact.xxx.com/client.pem']]
[host]
[host.'http://10.190.63.87:30020']
capabilities = ['pull', 'resolve']
[host.'http://10.190.63.87:30021']
capabilities = ['pull', 'resolve']
With this toml file, We see response of 200 in node which already has the image. This is the log,
{"level":"info","ts":1713970486.4934852,"caller":"registry/registry.go:133","msg":"","path":"/v2/xxx/blobs/sha256:a13193a44ca5935965796c8b5a79c340d9a4fdca268a72d5a4900e8e97587477","status":200,"method":"GET","latency":164.578264002,"ip":"10.190.46.158" }
As per the spegel document we see the logs having handler= mirror in the node which already has the docker image. This handler = mirror is not seen in our log. But the ip address is of the new node which is trying to pull the image.
And In the logs of the new node trying to pull the image, we are not seeing any response 200. This is the log of the new node trying to pull the image from another node.
configuration {"level":"info","ts":1713970359.27705,"caller":"oci/containerd.go:527","msg":"added Containerd mirror configuration","registry":"https:/ │
│ /ghcr.io","path":"/etc/containerd/certs.d/ghcr.io/hosts.toml"} │
│ configuration {"level":"info","ts":1713970359.2771664,"caller":"oci/containerd.go:527","msg":"added Containerd mirror configuration","registry":"https │
│ ://quay.io","path":"/etc/containerd/certs.d/quay.io/hosts.toml"} │
│ configuration {"level":"info","ts":1713970359.277267,"caller":"oci/containerd.go:527","msg":"added Containerd mirror configuration","registry":"https: │
│ //mcr.microsoft.com","path":"/etc/containerd/certs.d/mcr.microsoft.com/hosts.toml"} │
│ configuration {"level":"info","ts":1713970359.2773669,"caller":"oci/containerd.go:527","msg":"added Containerd mirror configuration","registry":"https │
│ ://public.ecr.aws","path":"/etc/containerd/certs.d/public.ecr.aws/hosts.toml"} │
│ registry {"level":"info","ts":1713971054.7486622,"caller":"state/state.go:30","msg":"running scheduled image state update"} │
│ registry {"level":"info","ts":1713971097.4811194,"caller":"state/state.go:39","msg":"received image event","image":"xxx@sha256:fb5e6a0466be22dd75e1d1202eb0658118bc200eabacad3ba4e9627ae07f4f7e","type":"CREATE"} │
│ registry {"level":"info","ts":1713971097.5458224,"caller":"state/state.go:39","msg":"received image event","image":"xxx@sha256:fb5e6a0466be22dd75e1d1202eb0658118bc200eabacad3ba4e9627ae07f4f7e","type":"UPDATE"} │
│ registry {"level":"info","ts":1713971097.569677,"caller":"state/state.go:39","msg":"received image event","image":"xxx@sha256:fb5e6a0466be22dd75e1d1202eb0658118bc200eabacad3ba4e9627ae07f4f7e","type":"CREATE"} │
│ configuration {"level":"info","ts":1713970359.2774603,"caller":"oci/containerd.go:527","msg":"added Containerd mirror configuration","registry":"https │
│ ://gcr.io","path":"/etc/containerd/certs.d/gcr.io/hosts.toml"} │
│ configuration {"level":"info","ts":1713970359.277548,"caller":"oci/containerd.go:527","msg":"added Containerd mirror configuration","registry":"https: │
│ //registry.k8s.io","path":"/etc/containerd/certs.d/registry.k8s.io/hosts.toml"} │
│ configuration {"level":"info","ts":1713970359.2776377,"caller":"oci/containerd.go:527","msg":"added Containerd mirror configuration","registry":"https │
│ ://k8s.gcr.io","path":"/etc/containerd/certs.d/k8s.gcr.io/hosts.toml"} │
│ configuration {"level":"info","ts":1713970359.2777264,"caller":"oci/containerd.go:527","msg":"added Containerd mirror configuration","registry":"https │
│ ://lscr.io","path":"/etc/containerd/certs.d/lscr.io/hosts.toml"} │
│ configuration {"level":"info","ts":1713970359.2778192,"caller":"oci/containerd.go:527","msg":"added Containerd mirror configuration","registry":"https │
│ ://xxx","path":"/etc/containerd/certs.d/xxx/hosts.toml"} │
│ configuration {"level":"info","ts":1713970359.2778912,"caller":"build/main.go:88","msg":"gracefully shutdown"} │
│ Stream closed EOF for spegel/spegel-z6fbx (configuration)
We tried this with version 0.0.21 and v 0.0.22. Not sure, where we are going wrong. But we still feel that the new node is not pulling the image from the node that already has the image, but everytime it is pulling it from artifactory. Can you guide us on where it is going wrong and what has to be done to proceed furthur?
Thanks.
Sorry about the logs, I was doing some refactoring and accidentally removed the handler from the logs. So it is no longer easy to see if its the mirror or the serve handler that is running. Will work on getting that back in.
I setup similar solution with client certificates and got thing working without any issue. I pull the first image from the original registry. Then move on to another node and am able to pull the image from the other node this time.
Looking at your logs it looks like the image pull is occurring before the actual mirror configuration is added? Maybe that is the issue to your problems that the Pod is created to quickly? Could you verify that Spegel has started properly before you try pulling the image.
I see,
The setup that we have is using Karpenter to spin up nodes. Now as soon as the nodes come up the pod starts scheduling in that node. I think that at that point even spegel pod might not be up due to which this might have caused the issue.
Is there any way that we can create a dependency that spegel pod or the daemonset should be up before our pod runs?
That is a good question, I do not have the answer off the top of my head.
Could we close this issue and open another one with this question? I think it would be something valuable to document but it is off topic to the issue at hand.
Hi @phillebaba
For the previous question regarding the daemonset, we will create a new issue.
But for testing purpose, we tried assigning pods to the nodes in which spegel is already started.
Log of first node: Node 1: [ip-10-190-58-49]
│ registry {"time":"2024-05-09T02:50:57.010817724Z","level":"INFO","source":{"function":"github.com/spegel-org/spegel/pkg/registry.(*Registry).handle.func1","file":"/build/pkg/registry │
│ /registry.go","line":132},"msg":"","path":"/XXX/blobs/sha256:b93d1501ce1249752008d3221380255b8f3c50144a1340b39abdacef1802a8d4","status":200, │
│ "method":"GET","latency":"1m51.167792276s","ip":"10.190.61.15"} │
│ registry {"time":"2024-05-09T02:51:16.015208831Z","level":"INFO","source":{"function":"github.com/spegel-org/spegel/pkg/registry.(*Registry).handle.func1","file":"/build/pkg/registry │
│ /registry.go","line":132},"msg":"","path":"/XXX/blobs/sha256:a13193a44ca5935965796c8b5a79c340d9a4fdca268a72d5a4900e8e97587477","status":200, │
│ "method":"GET","latency":"2m15.050096046s","ip":"10.190.61.15"} │
│ registry {"time":"2024-05-09T02:57:58.533665689Z","level":"INFO","source":{"function":"github.com/spegel-org/spegel/pkg/state.Track","file":"/build/pkg/state/state.go","line":33},"ms │
│ g":"running scheduled image state update"} │
│ registry {"time":"2024-05-09T03:02:04.80476148Z","level":"INFO","source":{"function":"github.com/spegel-org/spegel/pkg/state.Track","file":"/build/pkg/state/state.go","line":42},"msg │
│ ":"received image event","image":"xxx:1.0@sha256:fb5e6a0466be22dd75e1d1202eb0658118bc200eabacad3ba4e962 │
│ 7ae07f4f7e","type":"CREATE"} │
│ registry {"time":"2024-05-09T03:02:05.041862364Z","level":"INFO","source":{"function":"github.com/spegel-org/spegel/pkg/state.Track","file":"/build/pkg/state/state.go","line":42},"ms │
│ g":"received image event","image":"xxx:1.0@sha256:fb5e6a0466be22dd75e1d1202eb0658118bc200eabacad3ba4e96 │
│ 27ae07f4f7e","type":"UPDATE"} │
│ registry {"time":"2024-05-09T03:02:05.044542155Z","level":"INFO","source":{"function":"github.com/spegel-org/spegel/pkg/state.Track","file":"/build/pkg/state/state.go","line":42},"ms │
│ g":"received image event","image":"xxx@sha256:fb5e6a0466be22dd75e1d1202eb0658118bc200eabacad3ba4e9627ae07f4f7e │
│ ","type":"CREATE"} │
│ registry {"time":"2024-05-09T03:06:58.53309718Z","level":"INFO","source":{"function":"github.com/spegel-org/spegel/pkg/state.Track","file":"/build/pkg/state/state.go","line":33},"msg │
│ ":"running scheduled image state update"} │
│ registry {"time":"2024-05-09T03:15:58.533623363Z","level":"INFO","source":{"function":"github.com/spegel-org/spegel/pkg/state.Track","file":"/build/pkg/state/state.go","line":33},"ms │
│ g":"running scheduled image state update"} │
Here we are able to see that node 2 (10.190.61.15) is trying to pull this node 1.
Log of node 2: [ip-10.190.61.15]
│ registry {"time":"2024-05-09T02:50:40.116834275Z","level":"INFO","source":{"function":"github.com/spegel-org/spegel/pkg/registry.(*Registry).handle.func1","file":"/build/pkg/registry │
│ /registry.go","line":132},"msg":"","path":"xxx/blobs/sha256:f8cc231105a81f50e16fc6bf48258cc297f555f2995bd15643c74efebe5b8d78","status":200, │
│ "method":"GET","latency":"1m21.349075008s","ip":"10.190.61.15"} │
│ registry {"time":"2024-05-09T02:50:56.910979572Z","level":"INFO","source":{"function":"github.com/spegel-org/spegel/pkg/registry.(*Registry).handle.func1","file":"/build/pkg/registry │
│ /registry.go","line":132},"msg":"","path":"xxx/blobs/sha256:b93d1501ce1249752008d3221380255b8f3c50144a1340b39abdacef1802a8d4","status":200, │
│ "method":"GET","latency":"1m51.062246479s","ip":"10.190.61.15"} │
│ registry {"time":"2024-05-09T02:51:15.960404766Z","level":"INFO","source":{"function":"github.com/spegel-org/spegel/pkg/registry.(*Registry).handle.func1","file":"/build/pkg/registry │
│ /registry.go","line":132},"msg":"","path":"xxx/blobs/sha256:a13193a44ca5935965796c8b5a79c340d9a4fdca268a72d5a4900e8e97587477","status":200, │
│ "method":"GET","latency":"2m14.993870993s","ip":"10.190.61.15"} │
│ registry {"time":"2024-05-09T02:53:41.042683724Z","level":"INFO","source":{"function":"github.com/spegel-org/spegel/pkg/state.Track","file":"/build/pkg/state/state.go","line":33},"ms │
│ g":"running scheduled image state update"} │
│ registry {"time":"2024-05-09T03:02:41.043052583Z","level":"INFO","source":{"function":"github.com/spegel-org/spegel/pkg/state.Track","file":"/build/pkg/state/state.go","line":33},"ms │
│ g":"running scheduled image state update"}
Based the log of node 2, we are not sure if it pulling the image from node1 or from artifactory. We don't see any considerable decrease in image pull time.
Can we help us understand what's happening based on the logs?
Thanks.