wave icon indicating copy to clipboard operation
wave copied to clipboard

Use k8s job for blob cache transfer

Open munishchouhan opened this issue 1 year ago • 8 comments

This PR will change the K8s strategy for blob transfer to use k8s job

munishchouhan avatar Apr 29 '24 11:04 munishchouhan

I am testing it locally, pods and job are getting created but blobs are not getting uploaded to aws s3 bucket working on fixing that

munishchouhan avatar Apr 29 '24 16:04 munishchouhan

Testing locally is not working, so I will test it in dev

munishchouhan avatar Apr 30 '24 16:04 munishchouhan

got this error, while tested in dev

2024-04-30 18:42:12.093	
16:42:12.092 [io-executor-thread-6] WARN  i.s.w.s.b.impl.BlobCacheServiceImpl - == Blob cache failed for object 'cr.seqera.io/v2/public/nf-jdk/blobs/sha256:4edf64cf85c039184023bdfaa7e82e8a607c7f0a55286cce0c938431af0d83d3' - cause: 
io.kubernetes.client.openapi.ApiException: 
	at io.kubernetes.client.openapi.ApiClient.handleResponse(ApiClient.java:989)
	at io.kubernetes.client.openapi.ApiClient.execute(ApiClient.java:905)
	at io.kubernetes.client.openapi.apis.BatchV1Api.createNamespacedJobWithHttpInfo(BatchV1Api.java:360)
	at io.kubernetes.client.openapi.apis.BatchV1Api.createNamespacedJob(BatchV1Api.java:333)
	at io.seqera.wave.service.k8s.K8sServiceImpl.transferJob(K8sServiceImpl.groovy:585)
	at io.seqera.wave.service.blob.impl.KubeTransferStrategy.transfer(KubeTransferStrategy.groovy:53)
	at io.seqera.wave.service.blob.impl.BlobCacheServiceImpl.store(BlobCacheServiceImpl.groovy:207)
	at io.seqera.wave.service.blob.impl.BlobCacheServiceImpl.storeIfAbsent(BlobCacheServiceImpl.groovy:186)
	at io.seqera.wave.service.blob.impl.BlobCacheServiceImpl.retrieveBlobCache(BlobCacheServiceImpl.groovy:102)
	at io.seqera.wave.controller.RegistryProxyController.fromDownloadResponse(RegistryProxyController.groovy:326)
	at io.seqera.wave.controller.RegistryProxyController.handleDelegate0(RegistryProxyController.groovy:231)
	at io.seqera.wave.controller.RegistryProxyController.handleGet0(RegistryProxyController.groovy:200)
	at io.seqera.wave.controller.RegistryProxyController.handleGet(RegistryProxyController.groovy:141)
	at io.seqera.wave.controller.$RegistryProxyController$Definition$Exec.dispatch(Unknown Source)
	at io.micronaut.context.AbstractExecutableMethodsDefinition$DispatchedExecutableMethod.invoke(AbstractExecutableMethodsDefinition.java:371)
	at io.micronaut.context.DefaultBeanContext$4.invoke(DefaultBeanContext.java:594)
	at io.micronaut.web.router.AbstractRouteMatch.execute(AbstractRouteMatch.java:303)
	at io.micronaut.web.router.RouteMatch.execute(RouteMatch.java:111)
	at io.micronaut.http.context.ServerRequestContext.with(ServerRequestContext.java:103)
	at io.micronaut.http.server.RouteExecutor.lambda$executeRoute$14(RouteExecutor.java:659)
	at reactor.core.publisher.FluxDeferContextual.subscribe(FluxDeferContextual.java:49)
	at reactor.core.publisher.InternalFluxOperator.subscribe(InternalFluxOperator.java:62)
	at reactor.core.publisher.FluxSubscribeOn$SubscribeOnSubscriber.run(FluxSubscribeOn.java:194)
	at io.micronaut.reactive.reactor.instrument.ReactorInstrumentation.lambda$init$0(ReactorInstrumentation.java:62)
	at reactor.core.scheduler.WorkerTask.call(WorkerTask.java:84)
	at reactor.core.scheduler.WorkerTask.call(WorkerTask.java:37)
	at io.micrometer.core.instrument.composite.CompositeTimer.recordCallable(CompositeTimer.java:129)
	at io.micrometer.core.instrument.Timer.lambda$wrap$1(Timer.java:206)
	at io.micronaut.scheduling.instrument.InvocationInstrumenterWrappedCallable.call(InvocationInstrumenterWrappedCallable.java:53)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:840)

munishchouhan avatar Apr 30 '24 16:04 munishchouhan

I think it needs permissions to create job in k8s

munishchouhan avatar Apr 30 '24 16:04 munishchouhan

tested in local with working dev blob cache

  1. jobs and pods are created and blobs are also uploaded but docker pull fails:
munish.chouhan@Munishs-MacBook-Pro ~ % docker pull ffb6c929b79f.ngrok.app/wt/de20f99af173/public/nf-jdk:corretto-17.0.10
corretto-17.0.10: Pulling from wt/de20f99af173/public/nf-jdk
1243323cbbce: Retrying in 4 seconds
f9a83abe90dc: Retrying in 14 seconds
1fc87daf47ad: Download complete
6d3480d9a1b8: Downloading
unknown: {"operation":"pipe","success":true,"destination":"s3://wave-cache-dev/cr.seqera.io/v2/public/nf-jdk/blobs/sha256:6d3480d9a1b8740730551c49ea4792ce68b8ee74638fd4b58f34783304e96362","object":{"type":"file"}}

munishchouhan avatar May 03 '24 14:05 munishchouhan

working now:

munish.chouhan@Munishs-MacBook-Pro blob_cache_testing % wave -i cr.seqera.io/public/nf-jdk:corretto-17.0.10 --wave-endpoint http://localhost:9090
ffb6c929b79f.ngrok.app/wt/6e2922114717/public/nf-jdk:corretto-17.0.10
munish.chouhan@Munishs-MacBook-Pro blob_cache_testing % docker pull ffb6c929b79f.ngrok.app/wt/6e2922114717/public/nf-jdk:corretto-17.0.10
corretto-17.0.10: Pulling from wt/6e2922114717/public/nf-jdk
1243323cbbce: Pull complete
f9a83abe90dc: Pull complete
1fc87daf47ad: Pull complete
6d3480d9a1b8: Pull complete
Digest: sha256:daf635dae478659aeb86296c64cfdc4dee6e1f3bc9ab49fe87034829c62d818a
Status: Downloaded newer image for ffb6c929b79f.ngrok.app/wt/6e2922114717/public/nf-jdk:corretto-17.0.10
ffb6c929b79f.ngrok.app/wt/6e2922114717/public/nf-jdk:corretto-17.0.10

munishchouhan avatar May 06 '24 11:05 munishchouhan

tested in dev

% wave -i cr.seqera.io/public/nf-jdk:corretto-17.0.9 --wave-endpoint https://wave.dev-tower.net
wave.dev-tower.net/wt/5ed41ad6b28a/public/nf-jdk:corretto-17.0.9
munish.chouhan@Munishs-MacBook-Pro blob_cache_testing % docker pull wave.dev-tower.net/wt/5ed41ad6b28a/public/nf-jdk:corretto-17.0.9
corretto-17.0.9: Pulling from wt/5ed41ad6b28a/public/nf-jdk
6ebddf7084e9: Pull complete
fca718df3f34: Pull complete
18d54cc063eb: Pull complete
18afe249605d: Pull complete
Digest: sha256:b03a79c48047dbeaea8eed4117fc0b069458b9093cb833b93a38ad7d49b7a11a
Status: Downloaded newer image for wave.dev-tower.net/wt/5ed41ad6b28a/public/nf-jdk:corretto-17.0.9

munishchouhan avatar May 06 '24 23:05 munishchouhan

@pditommaso ready for review

munishchouhan avatar May 06 '24 23:05 munishchouhan

This PR requires this PR to be merge first https://github.com/seqeralabs/platform-deployment/pull/363

munishchouhan avatar May 24 '24 11:05 munishchouhan

@munishchouhan can you please have a look at the merge conflicts?

pditommaso avatar Jul 22 '24 08:07 pditommaso

working on fixing test

munishchouhan avatar Jul 22 '24 09:07 munishchouhan

tested again: successfull

(base) munish.chouhan@Munishs-MacBook-Pro ~ % kubectl get jobs -n wave-local
NAME                        COMPLETIONS   DURATION   AGE
transfer-a43a48be12e6f4aa   0/1           2s         2s
transfer-918964c839b722ad   0/1           2s         2s
transfer-f96c05fe22f18394   0/1           2s         2s
(base) munish.chouhan@Munishs-MacBook-Pro ~ % kubectl get jobs -n wave-local
NAME                        COMPLETIONS   DURATION   AGE
transfer-f96c05fe22f18394   0/1           49s        49s
(base) munish.chouhan@Munishs-MacBook-Pro ~ % kubectl get jobs -n wave-local
No resources found in wave-local namespace.

munishchouhan avatar Jul 22 '24 12:07 munishchouhan

@munishchouhan I've deployed 1.10.0-B3 to stage, please make some more tests in relation to latest changes in this PR

pditommaso avatar Jul 24 '24 11:07 pditommaso

Single image transfer tested successfully, no errors in the logs

(base) munish.chouhan@Munishs-MacBook-Pro wave_testing % docker pull wave.stage-seqera.io/wt/xxxxxxx/public/nf-jdk:corretto-17.0.9
corretto-17.0.9: Pulling from wt/xxxxxxx/public/nf-jdk
6ebddf7084e9: Pull complete 
fca718df3f34: Pull complete 
18d54cc063eb: Pull complete 
18afe249605d: Pull complete 
Digest: sha256:b03a79c48047dbeaea8eed4117fc0b069458b9093cb833b93a38ad7d49b7a11a
Status: Downloaded newer image for wave.stage-seqera.io/wt/xxxxxxx/public/nf-jdk:corretto-17.0.9
wave.stage-seqera.io/wt/xxxxxxx/public/nf-jdk:corretto-17.0.9

What's next:
    View a summary of image vulnerabilities and recommendations → docker scout quickview wave.stage-seqera.io/wt/xxxxxxx/public/nf-jdk:corretto-17.0.9

Now I will test with multiple transfer requests for load testing

munishchouhan avatar Jul 24 '24 11:07 munishchouhan

Tested with 10 parallel images pull it was successful and no errors in the logs

munishchouhan avatar Jul 24 '24 12:07 munishchouhan

Can't see any == Blob cache entry in the stage logs

pditommaso avatar Jul 24 '24 13:07 pditommaso

== Blob cache

yes 116 enteries in last 3 hours Screenshot 2024-07-24 at 15 37 11

munishchouhan avatar Jul 24 '24 13:07 munishchouhan

My fault, the cli tool he fooled me

pditommaso avatar Jul 24 '24 13:07 pditommaso

I cannot merge it because it states all commits are not signed

munishchouhan avatar Jul 24 '24 13:07 munishchouhan