spegel icon indicating copy to clipboard operation
spegel copied to clipboard

Removal of certs in containerd configuration causing 500 error (mirror resolve retries exhausted for key)

Open IndhumithaR opened this issue 10 months ago • 33 comments

Spegel version

v0.0.19

Kubernetes distribution

EKS

Kubernetes version

v1.27.0

CNI

VPC CNI

Describe the bug

We are using jfrog artifactory registry as our registry. we are using tls certificate authentication for resolving and accessing our registry endpoints. We were facing 500 error with mirror resolve retries exhausted for key while running a pod, it gave this error

│   Type     Reason     Age                   From               Message                                                                                                                                                                                                 │
│   ----     ------     ----                  ----               -------                                                                                                                                                                                                 │
│   Normal   Scheduled  29m                   default-scheduler  Successfully assigned large-pod-1 to ip-10-190-18-201.eu-west-1.compute.internal                                                                                                                    │
│   Normal   Pulling    27m (x4 over 29m)     kubelet            Pulling image "xxx"                                                                                                  │
│   Warning  Failed     27m (x4 over 29m)     kubelet            Failed to pull image "xxx": rpc error: code = Unknown desc = failed to pull and unpack image "xxx": failed to resolve reference "xxx": unexpected status from HEAD request to http://10.190.18.201:30020/v2/xxx 500 Internal Server Error                                                                                                                                                 │
│   Warning  Failed     27m (x4 over 29m)     kubelet            Error: ErrImagePull                                                                                                                                                                                     │
│   Warning  Failed     27m (x6 over 29m)     kubelet            Error: ImagePullBackOff                                                                                                                                                                                 │
│   Normal   BackOff    4m6s (x109 over 29m)  kubelet            Back-off pulling image xxx"                                                                                         │
│                                                                                                                                                                  

On diving deep we saw that our client.cert and client.key files are not loaded in the configuration

configuration {"level":"info","ts":1712209973.4201605,"caller":"oci/containerd.go:511","msg":"backing up Containerd host configuration","path":"/etc/containerd/certs.d/xxx"}                                                            │
│ configuration {"level":"info","ts":1712209973.420308,"caller":"oci/containerd.go:564","msg":"added containerd mirror configuration","registry":"https://xxx","path":"/etc/containerd/certs.d/xxx/hosts.toml" │
│ configuration {"level":"info","ts":1712209973.4203854,"caller":"oci/containerd.go:564","msg":"added containerd mirror configuration","registry":"xxx","path":"/etc/containerd/certs.d/xxx/hosts.toml" │
│ configuration {"level":"info","ts":1712209973.4204602,"caller":"oci/containerd.go:564","msg":"added containerd mirror configuration","registry":"xxx","path":"/etc/containerd/certs.d/xxx/hosts.toml"}               │
│ configuration {"level":"info","ts":1712209973.4205267,"caller":"build/main.go:87","msg":"gracefully shutdown"}                     

only host.toml seems to be loaded to spegel. Therefore not being able to access artifactory and pull.

first.log

Sharing the spegel log file for reference. Can you please ensure that the all the files in cert configuration is copied in the spegel configuration?

IndhumithaR avatar Apr 04 '24 07:04 IndhumithaR

@IndhumithaR I understand the problem that you are in, as Spegel will override any configuration. I do agree that this should be fixed.

Could you share an example hosts.toml that you are using, which can be used in a unit test?

phillebaba avatar Apr 04 '24 10:04 phillebaba

Hello @phillebaba thanks for quick turnaround on this.

Actually we don't have a hosts.toml file. It gets created automatically with hosts on 30020 and 30021 with local ipv4 of the VM.

hosts.toml

[host."https://<LOCAL_VM_IPV4>:30020"]
  capabilities = ["pull", "resolve"]

[host."https://<LOCAL_VM_IPV4>:30021"]
  capabilities = ["pull", "resolve"]

But while researching we did find an interesting way to workaround the issue above by actually creating one hosts.toml file and adding client flag within the host configuration. But we are facing issues with it. When just adding client = ["../../client.cert", "../../client.key"]

we get an error - failed to load X509 key pair: tls: failed to find certificate PEM data in certificate input, but did find a private key; PEM inputs may have been switched

When we add it as client = [["../../client.cert", "../../client.key"],[""]] OR client = [["../../client.cert", "../../client.key"],["",""]] OR client = [["../../client.cert"], ["../../client.key"]]

then we get xxx": dial tcp 10.190.18.201:30020: connect: no route to host which means it still doesn't get connected.

But it would be far better if the whole configuration of the mirror gets copied the way we add it in certs.d configuration instead of just hosts.toml getting created.

badaldavda8 avatar Apr 05 '24 05:04 badaldavda8

Hi @phillebaba, As Badal said we did a workaround by creating one hosts.toml file and adding client flag within the host configuration.

Here is the sample Hosts.toml for your information,


mkdir -p /etc/containerd/certs.d/xxx/
cat > /etc/containerd/certs.d/xxx/hosts.toml << EOL
server = 'https://xxx'

[host]
[host.'http://${ipv4}:30020']
capabilities = ['pull', 'resolve']
client = ["/etc/certs/xxx/client.cert", "/etc/certs/xxx/client.key"]

[host.'http://${ipv4}:30021']
capabilities = ['pull', 'resolve']
client = ["/etc/certs/xxx/client.cert", "/etc/certs/xxx/client.key"]
EOL

IndhumithaR avatar Apr 05 '24 05:04 IndhumithaR

Also, I think it would be better that instead of creating a new hosts.toml file, it would be better if we copy it if its already present. Is it currently the case?

badaldavda8 avatar Apr 05 '24 05:04 badaldavda8

Yes I think the solution is to modify existing toml files instead of replacing them.

The current implementation is slightly optimistic when it comes to writing mirror configuration. To avoid any issues I thought it was best to completely replace any configuration. I do see however the use case to append to existing configuration.

A good solution would be to add a new configuration so users can decide whether to keep the existing behavior or use the new one. We can then default it to the current behavior and that way you can enable appending to mirror configuration. Does that sound good?

phillebaba avatar Apr 05 '24 09:04 phillebaba

Yes, I think it should work. We can try this approach.

IndhumithaR avatar Apr 06 '24 12:04 IndhumithaR

Yes I think the solution is to modify existing toml files instead of replacing them.

The current implementation is slightly optimistic when it comes to writing mirror configuration. To avoid any issues I thought it was best to completely replace any configuration. I do see however the use case to append to existing configuration.

A good solution would be to add a new configuration so users can decide whether to keep the existing behavior or use the new one. We can then default it to the current behavior and that way you can enable appending to mirror configuration. Does that sound good?

Yes. That would be great. Also, not just add hosts.toml files but whatever files are in configuration like client.key and client.cert to the configuration and load it. Also it would be great that we add a print statement in logs for this and the way we are saying

configuration {"level":"info","ts":1712209973.4201605,"caller":"oci/containerd.go:511","msg":"backing up Containerd host configuration","path":"/etc/containerd/certs.d/xxx"}                                                            │
│ configuration {"level":"info","ts":1712209973.420308,"caller":"oci/containerd.go:564","msg":"added containerd mirror configuration","registry":"https://xxx","path":"/etc/containerd/certs.d/xxx/hosts.toml" │
│ configuration {"level":"info","ts":1712209973.4203854,"caller":"oci/containerd.go:564","msg":"added containerd mirror configuration","registry":"xxx","path":"/etc/containerd/certs.d/xxx/hosts.toml" │
│ configuration {"level":"info","ts":1712209973.4204602,"caller":"oci/containerd.go:564","msg":"added containerd mirror configuration","registry":"xxx","path":"/etc/containerd/certs.d/xxx/hosts.toml"}               │
│ configuration {"level":"info","ts":1712209973.4205267,"caller":"build/main.go:87","msg":"gracefully shutdown"}                     

we should have it printed as added containerd mirror configuration with all the paths of all the files so that when spegel starts we know what all mirrors and what all files are loaded and added to the configuration. In this if hosts.toml file is present then overwrite. But if its not present, create it.

badaldavda8 avatar Apr 08 '24 06:04 badaldavda8

@badaldavda8 @IndhumithaR

I have created #424 that should resolve this issue.

For clarification of the change it adds a new flag --append-mirrors which when enabled will append to existing mirror configurations. Give the existing mirror configuration for docker.io on the node.

server = 'https://registry-1.docker.io'

[host]

[host.'http://example.com:30020']
capabilities = ['pull', 'resolve']
client = ['/etc/certs/xxx/client.cert', '/etc/certs/xxx/client.key']

[host.'http://example.com:30021']
capabilities = ['pull', 'resolve']
client = ['/etc/certs/xxx/client.cert', '/etc/certs/xxx/client.key']

It will result in the following configuration after Spegel has run. In this example we are just adding a single mirror located at localhost:5000.

server = 'https://registry-1.docker.io'

[host]

[host.'http://127.0.0.1:5000']
capabilities = ['pull', 'resolve']

[host.'http://example.com:30020']
client = ['/etc/certs/xxx/client.cert', '/etc/certs/xxx/client.key']
capabilities = ['pull', 'resolve']

[host.'http://example.com:30021']
client = ['/etc/certs/xxx/client.cert', '/etc/certs/xxx/client.key']
capabilities = ['pull', 'resolve']

Does this match the needs you have to solve the problem?

phillebaba avatar Apr 08 '24 11:04 phillebaba

@badaldavda8 @IndhumithaR

I have created #424 that should resolve this issue.

For clarification of the change it adds a new flag --append-mirrors which when enabled will append to existing mirror configurations. Give the existing mirror configuration for docker.io on the node.

server = 'https://registry-1.docker.io'

[host]

[host.'http://example.com:30020']
capabilities = ['pull', 'resolve']
client = ['/etc/certs/xxx/client.cert', '/etc/certs/xxx/client.key']

[host.'http://example.com:30021']
capabilities = ['pull', 'resolve']
client = ['/etc/certs/xxx/client.cert', '/etc/certs/xxx/client.key']

It will result in the following configuration after Spegel has run. In this example we are just adding a single mirror located at localhost:5000.

server = 'https://registry-1.docker.io'

[host]

[host.'http://127.0.0.1:5000']
capabilities = ['pull', 'resolve']

[host.'http://example.com:30020']
client = ['/etc/certs/xxx/client.cert', '/etc/certs/xxx/client.key']
capabilities = ['pull', 'resolve']

[host.'http://example.com:30021']
client = ['/etc/certs/xxx/client.cert', '/etc/certs/xxx/client.key']
capabilities = ['pull', 'resolve']

Does this match the needs you have to solve the problem?

Hi @phillebaba, Yes this would solve our problem. And also it would be better if we would still have all our file like client.cert and client.key in the config path instead of moving them to the backup path.

It will great if you can give us an approximate ETA for fixing this issue?

IndhumithaR avatar Apr 08 '24 12:04 IndhumithaR

So the PR should be ready to merge now, and I could probably cut a release if this is a blocker for you.

I am however thinking about the best way of keeping the certificates in the configuration directory. The question is how far to take this as we would need to determine which files should be kept. For this to work we would need some method of determining which directories are host configurations and which are not.

Would it be possible for you to place the certificates in a different directory like /etc/certs/client.pem like they do in the documentation, rather than placing them in /etc/containerd/certs.d?

phillebaba avatar Apr 08 '24 19:04 phillebaba

Hello @phillebaba,

Thanks for adding the flag. But the problem that we are facing is that we are not sure exactly how do we add these certs in hosts.toml file and make it work. We were just attempting a workaround for that using hosts.toml.

It would be great if the whole directory is loaded with /etc/containerd/certs.d/xyz.com/client.cert and /etc/containerd/certs.d/xyz.com/client.key because this is how we have had it working until now.

We did try to add it as client as shared above even without spegel and it failed. Do you have any direction for that since containerd documentation is not clear on how do we add client.key and client.cert for tls authentication for each mirror.

We are still trying to add multiple configurations in containerd config.toml and hosts.toml. Will keep you updated on the issue as well.

badaldavda8 avatar Apr 09 '24 07:04 badaldavda8

One solution could be to copy over all of the file contents from the existing directories. To allow registry specific certs. I would rather avoid this if possible. If we were to add this I would probably add an additional flag for this behavior.

I had a look back at your previous examples and saw that you are using client cert pairs for authentication. Note that the Containerd documentation has different syntax for this compared to using a pem file. So its important that you configure them as key pairs.

client = [["/etc/certs/client.cert", "/etc/certs/client.key"]]

https://github.com/containerd/containerd/blob/main/docs/hosts.md#client-field

In the documentation they use absolute paths for both files, which seems to work fine even when the files are located in a different directory.

In theory you do not have to depend on Spegel to configure mirrors for you. As you are running EKS it is totally possible to just setup all of the mirror configuration in your custom bootstrap.sh file and then just disable Spegel mirror configuration.

I will go ahead and merge #424 and cut a release for that. If you are in need for more features to persist other files in the certs.d directory let me know. Will keep this issue open until things are resolved.

phillebaba avatar Apr 09 '24 11:04 phillebaba

@badaldavda8 @IndhumithaR I have cut v0.0.21 now which contains an option to enable merging. As long as you are not placing certificates in certs.d directory I think this would work for you.

phillebaba avatar Apr 10 '24 12:04 phillebaba

We will test this and get back to you!

badaldavda8 avatar Apr 10 '24 12:04 badaldavda8

Hi @phillebaba, We tried using v0.0.21 and --append-mirrors set to true. Even then we noticed that the existing hosts.toml which we have is pushed to /etc/containerd/certs.d/_backup and a new hosts.toml is created in /etc/containerd/certs.d/

Not sure, What's wrong.

IndhumithaR avatar Apr 15 '24 11:04 IndhumithaR

That is the expected behavior. It is not really possible to not backup all of the mirror configuration. We need the original registries to be able to update the registry configuration at a later date. Without the backup we wont know what the original state is.

You should see that the new host files are basically the original merged with the new configuration for Spegel, which is what we discussed.

The mirror configuration in Spegel is mostly just helper logic for the purpose of simplifying operations for the majority of the user. As I suggested before. if your use case does not fit in with this solution you could disable Spegel mirror configuration and rely on your own method instead.

Just to make sure the new hosts.toml are correct right? With the merged content.

phillebaba avatar Apr 16 '24 08:04 phillebaba

Hi @phillebaba We see that the hosts.toml is getting overwritten. It is not getting merged.

IndhumithaR avatar Apr 16 '24 08:04 IndhumithaR

That is the expected behavior. It is not really possible to not backup all of the mirror configuration. We need the original registries to be able to update the registry configuration at a later date. Without the backup we wont know what the original state is.

You should see that the new host files are basically the original merged with the new configuration for Spegel, which is what we discussed.

The mirror configuration in Spegel is mostly just helper logic for the purpose of simplifying operations for the majority of the user. As I suggested before. if your use case does not fit in with this solution you could disable Spegel mirror configuration and rely on your own method instead.

Just to make sure the new hosts.toml are correct right? With the merged content.

How do you suggest that we disable spegel mirror configuration? Is it by using spegel.containerdMirrorAdd=false?

IndhumithaR avatar Apr 16 '24 09:04 IndhumithaR

Yes if you set that to false you will disable the whole init container which writes mirror configuration.

phillebaba avatar Apr 16 '24 09:04 phillebaba

Hi @phillebaba, We tried disabling spegel mirror configuration.

Now we are able to pull docker images but getting the below error often and after multiple retries the image is getting pulled.

│  code = Canceled desc = failed to pull and unpack image "xxx": failed to commit snapshot extract-975686076- │
│ HBR8 sha256:d2fc1ed9e02fb5a7bd0cc587214cf26bdb04e4daa5263c7ba5847266ed32ff6a: context canceled  

Do you know why we are getting this error?

IndhumithaR avatar Apr 16 '24 12:04 IndhumithaR

Hi @phillebaba,

We deployed a pod to pull a docker image, a new node is created and tried to pull the image. After several tries, finally the image is pulled in 10 mins. The below is the pod log.

 Events:                                                                                                                                                                                                                                           │
│   Type     Reason            Age                From               Message                                                                                                                                                                        │
│   ----     ------            ----               ----               -------                                                                                                                                                                        │
│   Warning  FailedScheduling  50m                default-scheduler  0/2 nodes are available: 1 node(s) didn't match Pod's node affinity/selector, 1 node(s) had untolerated taint {karpenter.sh/disruption: disrupting}. preemption: 0/2 nodes are │
│  available: 2 Preemption is not helpful for scheduling..                                                                                                                                                                                          │
│   Normal   Nominated         50m                karpenter          Pod should schedule on: nodeclaim/cpu-node-pool-m74hm                                                                                                                          │
│   Normal   Scheduled         48m                default-scheduler  Successfully assigned dtp/large-pod-3 to ip-10-190-46-51.eu-west-1.compute.internal                                                                                            │
│   Warning  Failed            45m                kubelet            Failed to pull image "xxx:1.0": rpc error: code = Canceled desc = failed to pull and unpack ima │
│ ge "xxx:1.0": context canceled                                                                                                                                     │
│   Warning  Failed            43m                kubelet            Failed to pull image "xxx:1.0": rpc error: code = Canceled desc = failed to pull and unpack ima │
│ ge "xxx:1.0": failed to commit snapshot extract-787971535-l7S7 sha256:d2fc1ed9e02fb5a7bd0cc587214cf26bdb04e4daa5263c7ba5847266ed32ff6a: context canceled           │
│   Warning  Failed            39m                kubelet            Failed to pull image "xxx:1.0": rpc error: code = Canceled desc = failed to pull and unpack ima │
│ ge "xxx:1.0": failed to extract layer sha256:6779e674d34da348414644d25ffb0254c26178c88ae6854f5e1d9bc3719fab99: context canceled                                    │
│   Warning  Failed            39m (x3 over 45m)  kubelet            Error: ErrImagePull                                                                                                                                                            │
│   Normal   BackOff           39m (x4 over 45m)  kubelet            Back-off pulling image "xxx:1.0"                                                                │
│   Warning  Failed            39m (x4 over 45m)  kubelet            Error: ImagePullBackOff                                                                                                                                                        │
│   Normal   Pulling           39m (x4 over 48m)  kubelet            Pulling image "xxx:1.0"                                                                         │
│   Normal   Pulled            28m                kubelet            Successfully pulled image "xxx:1.0" in 10m23.459933513s (10m23.459942873s including waiting)    │
│   Normal   Created           28m                kubelet            Created container large-container1                                                                                                                                             │
│   Normal   Started           28m                kubelet            Started container large-container1     

Spegel log for the node:

│ {"level":"info","ts":1713763231.1640038,"caller":"build/main.go:204","msg":"running Spegel","registry":":5000","router":":5001"}                                                                                                                  │
│ {"level":"info","ts":1713763231.164356,"caller":"state/state.go:30","msg":"running scheduled image state update"}                                                                                                                                 │
│ {"level":"info","ts":1713763231.1648052,"logger":"p2p","caller":"routing/p2p.go:123","msg":"starting p2p router","id":"/ip4/10.190.39.116/tcp/5001/p2p/12D3KooWMryfM2RbusSZQbcK64LpRR6udkjKaavpyFBGFr3nxk9x"}                                     │
│ {"level":"info","ts":1713763231.1651144,"caller":"leaderelection/leaderelection.go:250","msg":"attempting to acquire leader lease spegel/spegel-leader-election...\n"}                                                                            │
│ {"level":"info","ts":1713763771.1649404,"caller":"state/state.go:30","msg":"running scheduled image state update"}                                                                                                                                │
│ {"level":"info","ts":1713764245.5022466,"caller":"state/state.go:39","msg":"received image event","image":"xxx:1.0@sha256:fb5e6a0466be22dd75e1d1202eb0658118bc200e │
│ {"level":"info","ts":1713764245.6634855,"caller":"state/state.go:39","msg":"received image event","image":"xxx:1.0@sha256:fb5e6a0466be22dd75e1d1202eb0658118bc200e │
│ {"level":"info","ts":1713764245.7156382,"caller":"state/state.go:39","msg":"received image event","image":"xxx@sha256:fb5e6a0466be22dd75e1d1202eb0658118bc200eabacad3ba4e │
│ {"level":"info","ts":1713764311.16439,"caller":"state/state.go:30","msg":"running scheduled image state update"}                                                                                                                                  │
│ {"level":"info","ts":1713764851.1644804,"caller":"state/state.go:30","msg":"running scheduled image state update"}                                                                                                                                │
│ {"level":"info","ts":1713765391.164477,"caller":"state/state.go:30","msg":"running scheduled image state update"}                                                                                                                                 │
│ {"level":"info","ts":1713765931.1649988,"caller":"state/state.go:30","msg":"running scheduled image state update"}    
 

We deployed another pod which gets assigned to a new node. This pod also tries to pull the same image. Below are its log,

Events:                                                                                                                                                                                                                                           │
│   Type     Reason     Age                From               Message                                                                                                                                                                               │
│   ----     ------     ----               ----               -------                                                                                                                                                                               │
│   Normal   Scheduled  21m                default-scheduler  Successfully assigned dtp/large-pod-2 to ip-10-190-41-236.eu-west-1.compute.internal                                                                                                  │
│   Warning  Failed     18m                kubelet            Failed to pull image "xxx:1.0": rpc error: code = Canceled desc = failed to pull and unpack image "art │
│xxx:1.0": context canceled                                                                                                                                            │
│   Warning  Failed     16m                kubelet            Failed to pull image "xxx:1.0": rpc error: code = Canceled desc = failed to pull and unpack image "art │
│ xx:1.0": failed to extract layer sha256:6779e674d34da348414644d25ffb0254c26178c88ae6854f5e1d9bc3719fab99: context canceled                                           │
│   Warning  Failed     13m (x3 over 18m)  kubelet            Error: ErrImagePull                                                                                                                                                                   │
│   Warning  Failed     13m                kubelet            Failed to pull image "xxx:1.0": rpc error: code = Canceled desc = failed to pull and unpack image "art │
│xxx:1.0": failed to commit snapshot extract-116295700--zVv sha256:d2fc1ed9e02fb5a7bd0cc587214cf26bdb04e4daa5263c7ba5847266ed32ff6a: context canceled                  │
│   Normal   BackOff    12m (x5 over 18m)  kubelet            Back-off pulling image "xxx:1.0"                                                                       │
│   Warning  Failed     12m (x5 over 18m)  kubelet            Error: ImagePullBackOff                                                                                                                                                               │
│   Normal   Pulling    12m (x4 over 21m)  kubelet            Pulling image "xxx:1.0"                                                                                │
│   Normal   Pulled     114s               kubelet            Successfully pulled image "xxx:1.0" in 10m22.187702651s (10m22.187717239s including waiting)           │
│   Normal   Created    114s               kubelet            Created container large-container1                                                                                                                                                    │
│   Normal   Started    114s               kubelet            Started container large-container1    

Spegel log for 2nd node:

│ {"level":"info","ts":1713763055.7157514,"caller":"build/main.go:204","msg":"running Spegel","registry":":5000","router":":5001"}                                                                                                                  │
│ {"level":"info","ts":1713763055.716569,"caller":"state/state.go:30","msg":"running scheduled image state update"}                                                                                                                                 │
│ {"level":"info","ts":1713763055.7168446,"logger":"p2p","caller":"routing/p2p.go:123","msg":"starting p2p router","id":"/ip4/10.190.36.182/tcp/5001/p2p/12D3KooWJJt6drajVNQfjWedNueQ8c9VszokqhjSTQcVdC9w6RAL"}                                     │
│ {"level":"info","ts":1713763055.716939,"caller":"leaderelection/leaderelection.go:250","msg":"attempting to acquire leader lease spegel/spegel-leader-election...\n"}                                                                             │
│ {"level":"info","ts":1713763595.716495,"caller":"state/state.go:30","msg":"running scheduled image state update"}                                                                                                                                 │
│ {"level":"info","ts":1713764135.7163444,"caller":"state/state.go:30","msg":"running scheduled image state update"}                                                                                                                                │
│ {"level":"info","ts":1713764675.7167358,"caller":"state/state.go:30","msg":"running scheduled image state update"}                                                                                                                                │
│ {"level":"info","ts":1713765215.7165866,"caller":"state/state.go:30","msg":"running scheduled image state update"}                                                                                                                                │
│ {"level":"info","ts":1713765755.716464,"caller":"state/state.go:30","msg":"running scheduled image state update"}                                                                                                                                 │
│ {"level":"info","ts":1713766106.3790374,"caller":"state/state.go:39","msg":"received image event","image":"xxx:1.0@sha256:fb5e6a0466be22dd75e1d1202eb0658118bc200e │
│ {"level":"info","ts":1713766106.4227855,"caller":"state/state.go:39","msg":"received image event","image":"xxx:1.0@sha256:fb5e6a0466be22dd75e1d1202eb0658118bc200e │
│ {"level":"info","ts":1713766106.4387383,"caller":"state/state.go:39","msg":"received image event","image":"xxx@sha256:fb5e6a0466be22dd75e1d1202eb0658118bc200eabacad3ba4e │
│ {"level":"info","ts":1713766295.716398,"caller":"state/state.go:30","msg":"running scheduled image state update"}                                                                                                                                

We see no improvement in image pull time of the first node and second node. According to our understanding of spegel, the first node will pull from artifactory and second node will pull the image from the first node. In this case, we expect some improvement in the image will time. But actually took the same time.

Even when we look at the spegel logs, it doesn't tell anything like it pulled the image from the first node. We need your help proper understanding of this.

FYI, while installing spegel, we set spegel.appendMirrors=true and set spegel.containerdMirrorAdd=false. We also set spegel.registries to our private registries.

We also tried increasing spegel.mirrorResolveRetries=4 and spegel.mirrorResolveTimeout="25s" it don't show any improvement.

Can you please explain the expected behavior or anything that we have missed out? Thanks

IndhumithaR avatar Apr 22 '24 06:04 IndhumithaR

Hi @phillebaba and the spegel team,

Thank you for the excellent work on the spegel project. We've been trying to use it in our application, but we're facing an issue with the TLS configuration while parsing the hosts.toml file.

Here are the details:

  1. We've provided the correct paths for the client.cert and client.key files in the hosts.toml file.
  2. We've verified that the certificate and private key files are in the correct PEM format, starting with -----BEGIN CERTIFICATE----- and -----BEGIN PRIVATE KEY-----, respectively.
  3. We've independently tested the tls.X509KeyPair function with the same certificate and private key files, and it works as expected.
  4. However, when running the spegel library, we're encountering the following error: failed to load X509 key pair: tls: found a certificate rather than a key in the PEM for the private key.
  5. This error seems to be occurring in the hosts.go file, specifically at line 224, where the tls.X509KeyPair function is called.
  6. Based on the Go standard library code (https://go.dev/src/crypto/tls/tls.go?s=5662:5729 - line 282), this error should only occur if the private key file contains "CERTIFICATE" instead of a valid private key.
  7. We've double-checked, and the private key file we're providing does not contain a certificate. It has the following format:
-----BEGIN PRIVATE KEY-----
<private_key_content>
-----END PRIVATE KEY-----

We're unsure why the spegel library is encountering this error when the standard Go TLS library works correctly with the same files.

We would greatly appreciate if you could investigate this issue and provide guidance on how we can resolve it. We've been trying to resolve this for the past two weeks without success, and we're eager to use the spegel project in our application.

Thank you in advance for your assistance.

Best regards, Badal

badaldavda8 avatar Apr 22 '24 20:04 badaldavda8

Also, if you can assist us by adding following in the debug log -

  1. hosts.toml file configuration that gets passed
  2. client.key and client.cert blocks getting passed
  3. Capability to preserve our hosts file. As shared by Indumitha, its not getting appended but overwritten currently even after adding the flag.

badaldavda8 avatar Apr 23 '24 00:04 badaldavda8

Now I am taking a wild guess but I think that the issue now is that you are passing the client certs to the wrong configuration and that the ordering might be wrong. As far as I understand the mirror configuration is that ordering matters, which could be a reason why you are not seeing any performance gains when things seem to work. The ordering should be top to bottom.

So when you are configuring the mirror configuration on your own it should look something like this. The first two entries are for Spegel. One will route to the local port and the other is a fallback to a random Spegel instance in the cluster. It isn't actually a requirement but a nice to have. Then after that comes the configuration for your artifactory registry with the client cert authentication.

server = 'https://artifactory-registry-url'

[host]
[host.'http://127.0.0.1:30020']
capabilities = ['pull', 'resolve']

[host.'http://127.0.0.1:30021']
capabilities = ['pull', 'resolve']

[host.'https://artifactory-registry-url']
client = ['/etc/certs/xxx/client.cert', '/etc/certs/xxx/client.key']
capabilities = ['pull', 'resolve']

I realize that I probably misled you in my examples. That is my fault I did not really think through the situation. I will probably add some examples to the docs to explain how mirror configuration needs to look like.

Does this make sense or have I gotten more confused about the actual issue?

phillebaba avatar Apr 24 '24 21:04 phillebaba

FYI I have not tested this but I assume this is the expected configuration for Containerd. Basically we are modifying the registry configuration to use client certs only for the artifactory registry. As far as I can read the Containerd documentation I cannot find another may to solve this.

phillebaba avatar Apr 24 '24 21:04 phillebaba

@phillebaba, Thanks for you response.

This is the hosts.toml, we are currently using.

server = 'https://artifact.xxx.com'
client = [['/etc/certs/artifact.xxx.com/client.cert','/etc/certs/artifact.xxx.com/client.pem']]
 
[host]
[host.'http://10.190.63.87:30020']
capabilities = ['pull', 'resolve']
 
 
[host.'http://10.190.63.87:30021']
capabilities = ['pull', 'resolve']

With this toml file, We see response of 200 in node which already has the image. This is the log,

{"level":"info","ts":1713970486.4934852,"caller":"registry/registry.go:133","msg":"","path":"/v2/xxx/blobs/sha256:a13193a44ca5935965796c8b5a79c340d9a4fdca268a72d5a4900e8e97587477","status":200,"method":"GET","latency":164.578264002,"ip":"10.190.46.158"  }

As per the spegel document we see the logs having handler= mirror in the node which already has the docker image. This handler = mirror is not seen in our log. But the ip address is of the new node which is trying to pull the image.

And In the logs of the new node trying to pull the image, we are not seeing any response 200. This is the log of the new node trying to pull the image from another node.

configuration {"level":"info","ts":1713970359.27705,"caller":"oci/containerd.go:527","msg":"added Containerd mirror configuration","registry":"https:/ │
│ /ghcr.io","path":"/etc/containerd/certs.d/ghcr.io/hosts.toml"}                                                                                         │
│ configuration {"level":"info","ts":1713970359.2771664,"caller":"oci/containerd.go:527","msg":"added Containerd mirror configuration","registry":"https │
│ ://quay.io","path":"/etc/containerd/certs.d/quay.io/hosts.toml"}                                                                                       │
│ configuration {"level":"info","ts":1713970359.277267,"caller":"oci/containerd.go:527","msg":"added Containerd mirror configuration","registry":"https: │
│ //mcr.microsoft.com","path":"/etc/containerd/certs.d/mcr.microsoft.com/hosts.toml"}                                                                    │
│ configuration {"level":"info","ts":1713970359.2773669,"caller":"oci/containerd.go:527","msg":"added Containerd mirror configuration","registry":"https │
│ ://public.ecr.aws","path":"/etc/containerd/certs.d/public.ecr.aws/hosts.toml"}                                                                         │
│ registry {"level":"info","ts":1713971054.7486622,"caller":"state/state.go:30","msg":"running scheduled image state update"}                            │
│ registry {"level":"info","ts":1713971097.4811194,"caller":"state/state.go:39","msg":"received image event","image":"xxx@sha256:fb5e6a0466be22dd75e1d1202eb0658118bc200eabacad3ba4e9627ae07f4f7e","type":"CREATE"}         │
│ registry {"level":"info","ts":1713971097.5458224,"caller":"state/state.go:39","msg":"received image event","image":"xxx@sha256:fb5e6a0466be22dd75e1d1202eb0658118bc200eabacad3ba4e9627ae07f4f7e","type":"UPDATE"}         │
│ registry {"level":"info","ts":1713971097.569677,"caller":"state/state.go:39","msg":"received image event","image":"xxx@sha256:fb5e6a0466be22dd75e1d1202eb0658118bc200eabacad3ba4e9627ae07f4f7e","type":"CREATE"}                     │
│ configuration {"level":"info","ts":1713970359.2774603,"caller":"oci/containerd.go:527","msg":"added Containerd mirror configuration","registry":"https │
│ ://gcr.io","path":"/etc/containerd/certs.d/gcr.io/hosts.toml"}                                                                                         │
│ configuration {"level":"info","ts":1713970359.277548,"caller":"oci/containerd.go:527","msg":"added Containerd mirror configuration","registry":"https: │
│ //registry.k8s.io","path":"/etc/containerd/certs.d/registry.k8s.io/hosts.toml"}                                                                        │
│ configuration {"level":"info","ts":1713970359.2776377,"caller":"oci/containerd.go:527","msg":"added Containerd mirror configuration","registry":"https │
│ ://k8s.gcr.io","path":"/etc/containerd/certs.d/k8s.gcr.io/hosts.toml"}                                                                                 │
│ configuration {"level":"info","ts":1713970359.2777264,"caller":"oci/containerd.go:527","msg":"added Containerd mirror configuration","registry":"https │
│ ://lscr.io","path":"/etc/containerd/certs.d/lscr.io/hosts.toml"}                                                                                       │
│ configuration {"level":"info","ts":1713970359.2778192,"caller":"oci/containerd.go:527","msg":"added Containerd mirror configuration","registry":"https │
│ ://xxx","path":"/etc/containerd/certs.d/xxx/hosts.toml"}                                     │
│ configuration {"level":"info","ts":1713970359.2778912,"caller":"build/main.go:88","msg":"gracefully shutdown"}                                         │
│ Stream closed EOF for spegel/spegel-z6fbx (configuration)

We tried this with version 0.0.21 and v 0.0.22. Not sure, where we are going wrong. But we still feel that the new node is not pulling the image from the node that already has the image, but everytime it is pulling it from artifactory. Can you guide us on where it is going wrong and what has to be done to proceed furthur?

Thanks.

IndhumithaR avatar Apr 25 '24 12:04 IndhumithaR

Sorry about the logs, I was doing some refactoring and accidentally removed the handler from the logs. So it is no longer easy to see if its the mirror or the serve handler that is running. Will work on getting that back in.

I setup similar solution with client certificates and got thing working without any issue. I pull the first image from the original registry. Then move on to another node and am able to pull the image from the other node this time.

Looking at your logs it looks like the image pull is occurring before the actual mirror configuration is added? Maybe that is the issue to your problems that the Pod is created to quickly? Could you verify that Spegel has started properly before you try pulling the image.

phillebaba avatar May 07 '24 11:05 phillebaba

I see,

The setup that we have is using Karpenter to spin up nodes. Now as soon as the nodes come up the pod starts scheduling in that node. I think that at that point even spegel pod might not be up due to which this might have caused the issue.

Is there any way that we can create a dependency that spegel pod or the daemonset should be up before our pod runs?

badaldavda8 avatar May 07 '24 11:05 badaldavda8

That is a good question, I do not have the answer off the top of my head.

Could we close this issue and open another one with this question? I think it would be something valuable to document but it is off topic to the issue at hand.

phillebaba avatar May 07 '24 20:05 phillebaba

Hi @phillebaba

For the previous question regarding the daemonset, we will create a new issue.

But for testing purpose, we tried assigning pods to the nodes in which spegel is already started.

Log of first node: Node 1: [ip-10-190-58-49]

│ registry {"time":"2024-05-09T02:50:57.010817724Z","level":"INFO","source":{"function":"github.com/spegel-org/spegel/pkg/registry.(*Registry).handle.func1","file":"/build/pkg/registry │
│ /registry.go","line":132},"msg":"","path":"/XXX/blobs/sha256:b93d1501ce1249752008d3221380255b8f3c50144a1340b39abdacef1802a8d4","status":200, │
│ "method":"GET","latency":"1m51.167792276s","ip":"10.190.61.15"}                                                                                                                        │
│ registry {"time":"2024-05-09T02:51:16.015208831Z","level":"INFO","source":{"function":"github.com/spegel-org/spegel/pkg/registry.(*Registry).handle.func1","file":"/build/pkg/registry │
│ /registry.go","line":132},"msg":"","path":"/XXX/blobs/sha256:a13193a44ca5935965796c8b5a79c340d9a4fdca268a72d5a4900e8e97587477","status":200, │
│ "method":"GET","latency":"2m15.050096046s","ip":"10.190.61.15"}                                                                                                                        │
│ registry {"time":"2024-05-09T02:57:58.533665689Z","level":"INFO","source":{"function":"github.com/spegel-org/spegel/pkg/state.Track","file":"/build/pkg/state/state.go","line":33},"ms │
│ g":"running scheduled image state update"}                                                                                                                                             │
│ registry {"time":"2024-05-09T03:02:04.80476148Z","level":"INFO","source":{"function":"github.com/spegel-org/spegel/pkg/state.Track","file":"/build/pkg/state/state.go","line":42},"msg │
│ ":"received image event","image":"xxx:1.0@sha256:fb5e6a0466be22dd75e1d1202eb0658118bc200eabacad3ba4e962 │
│ 7ae07f4f7e","type":"CREATE"}                                                                                                                                                           │
│ registry {"time":"2024-05-09T03:02:05.041862364Z","level":"INFO","source":{"function":"github.com/spegel-org/spegel/pkg/state.Track","file":"/build/pkg/state/state.go","line":42},"ms │
│ g":"received image event","image":"xxx:1.0@sha256:fb5e6a0466be22dd75e1d1202eb0658118bc200eabacad3ba4e96 │
│ 27ae07f4f7e","type":"UPDATE"}                                                                                                                                                          │
│ registry {"time":"2024-05-09T03:02:05.044542155Z","level":"INFO","source":{"function":"github.com/spegel-org/spegel/pkg/state.Track","file":"/build/pkg/state/state.go","line":42},"ms │
│ g":"received image event","image":"xxx@sha256:fb5e6a0466be22dd75e1d1202eb0658118bc200eabacad3ba4e9627ae07f4f7e │
│ ","type":"CREATE"}                                                                                                                                                                     │
│ registry {"time":"2024-05-09T03:06:58.53309718Z","level":"INFO","source":{"function":"github.com/spegel-org/spegel/pkg/state.Track","file":"/build/pkg/state/state.go","line":33},"msg │
│ ":"running scheduled image state update"}                                                                                                                                              │
│ registry {"time":"2024-05-09T03:15:58.533623363Z","level":"INFO","source":{"function":"github.com/spegel-org/spegel/pkg/state.Track","file":"/build/pkg/state/state.go","line":33},"ms │
│ g":"running scheduled image state update"}                                                                                                                                             │

Here we are able to see that node 2 (10.190.61.15) is trying to pull this node 1.

Log of node 2: [ip-10.190.61.15]

│ registry {"time":"2024-05-09T02:50:40.116834275Z","level":"INFO","source":{"function":"github.com/spegel-org/spegel/pkg/registry.(*Registry).handle.func1","file":"/build/pkg/registry │
│ /registry.go","line":132},"msg":"","path":"xxx/blobs/sha256:f8cc231105a81f50e16fc6bf48258cc297f555f2995bd15643c74efebe5b8d78","status":200, │
│ "method":"GET","latency":"1m21.349075008s","ip":"10.190.61.15"}                                                                                                                        │
│ registry {"time":"2024-05-09T02:50:56.910979572Z","level":"INFO","source":{"function":"github.com/spegel-org/spegel/pkg/registry.(*Registry).handle.func1","file":"/build/pkg/registry │
│ /registry.go","line":132},"msg":"","path":"xxx/blobs/sha256:b93d1501ce1249752008d3221380255b8f3c50144a1340b39abdacef1802a8d4","status":200, │
│ "method":"GET","latency":"1m51.062246479s","ip":"10.190.61.15"}                                                                                                                        │
│ registry {"time":"2024-05-09T02:51:15.960404766Z","level":"INFO","source":{"function":"github.com/spegel-org/spegel/pkg/registry.(*Registry).handle.func1","file":"/build/pkg/registry │
│ /registry.go","line":132},"msg":"","path":"xxx/blobs/sha256:a13193a44ca5935965796c8b5a79c340d9a4fdca268a72d5a4900e8e97587477","status":200, │
│ "method":"GET","latency":"2m14.993870993s","ip":"10.190.61.15"}                                                                                                                        │
│ registry {"time":"2024-05-09T02:53:41.042683724Z","level":"INFO","source":{"function":"github.com/spegel-org/spegel/pkg/state.Track","file":"/build/pkg/state/state.go","line":33},"ms │
│ g":"running scheduled image state update"}                                                                                                                                             │
│ registry {"time":"2024-05-09T03:02:41.043052583Z","level":"INFO","source":{"function":"github.com/spegel-org/spegel/pkg/state.Track","file":"/build/pkg/state/state.go","line":33},"ms │
│ g":"running scheduled image state update"}  

Based the log of node 2, we are not sure if it pulling the image from node1 or from artifactory. We don't see any considerable decrease in image pull time.

Can we help us understand what's happening based on the logs?

Thanks.

IndhumithaR avatar May 09 '24 14:05 IndhumithaR