Calotte
Calotte
It seems v0.0.23 also have this issue, always serval job failed to pull. I observed the following error in containerd log: _time="2024-08-20T17:27:12.187740008Z" level=error msg="cancel pulling image mcr.microsoft.com/azureml/runtime/boot/installed:0.0.1.20240813.2 because of no...
Is this possible for docker now? It seems containerd already support this feature.
{ "time": "2024-07-23T01:58:05.6412250Z", "level": "INFO", "source": { "function": "github.com/spegel-org/spegel/pkg/registry.(*Registry).handle.func1", "file": "/build/pkg/registry/registry.go", "line": 137 }, "msg": "", "path": "/v2/system/base/job/awsome-sidecar/blobs/sha256:725aac633332a1caecb201331077e8f3891b63fe937c342cb6940aa390c2e81f", "status": 200, "method": "GET", "latency": "15m38.932650294s", "ip": "100.64.87.42", "handler": "" } It...
Thanks for your response @phillebaba One concern for me is if there are 1k nodes cluster, if only 1-2 nodes have a large image, when we submit a distributed job...
I have more findings, from the log, it seems sometimes some nodes has high latency when copy blob between nodes. Those nodes are exactly the same, not sure why this...
I suspect if a node server is pulling images and the blob download maybe slow since I found at that time this node also download other images. Our scenario is...
I have a proposal for this issue. The key issue that prevent us to enable spegel is containerd pull image failed with spegel due to timeout and canceled. It's hard...