bacalhau icon indicating copy to clipboard operation
bacalhau copied to clipboard

Bacalhau does not detect a missing NVIDIA Container Toolkit installation

Open Zorlin opened this issue 1 year ago • 2 comments

Feb 21 20:05:14 blackberry bacalhau[32772]: {"level":"error","NodeID":"QmNRY4XQ","stack":[{"func":"(*BaseScheduler).stopJob","line":"534","source":"scheduler.go"},{"func":"(*BaseScheduler).StartJob.func1","line":"80","source":"scheduler.go"},{"func":"(*BaseScheduler).StartJob","line":"89","source":"scheduler.go"},{"func":"(*queue).StartJob","line":"49","source":"queue.go"},{"func":"(*BaseEndpoint).handleBidResponse","line":"230","source":"endpoint.go"},{"func":"(*BaseEndpoint).SubmitJob","line":"136","source":"endpoint.go"},{"func":"(*RequesterAPIServer).submit","line":"54","source":"endpoints_submit.go"},{"func":"HandlerFunc.ServeHTTP","line":"2122","source":"server.go"},{"func":"(*Handler).ServeHTTP","line":"213","source":"handler.go"},{"func":"LimitHandler.func1","line":"340","source":"tollbooth.go"},{"func":"HandlerFunc.ServeHTTP","line":"2122","source":"server.go"},{"func":"(*timeoutHandler).ServeHTTP.func1","line":"3396","source":"server.go"},{"func":"goexit","line":"1598","source":"asm_amd64.s"}],"error":"not enough nodes to run job. requested: 1, available: 0","time":"2024-02-21T20:05:14.300957883Z","caller":"pkg/requester/scheduler.go:534","message":"error completing job 5f237108-3e8a-401a-aba1-0223a190b716"}
Feb 21 20:05:14 blackberry bacalhau[32772]: {"level":"info","NodeID":"QmNRY4XQ","EventName":"Error","JobID":"5f237108-3e8a-401a-aba1-0223a190b716","SourceNodeID":"QmNRY4XQqroYueBjaGcnJXUAQngitqcJfV2Fu453zqMxEd","TargetNodeID":"","ClientID":"","Status":"not enough nodes to run job. requested: 1, available: 0","HandleDuration":0.0321,"time":"2024-02-21T20:05:14.300998203Z","caller":"pkg/eventhandler/chained_handlers.go:73","message":"Handled event"}
Feb 21 20:05:14 blackberry bacalhau[32772]: {"level":"error","NodeID":"QmNRY4XQ","error":"not enough nodes to run job. requested: 1, available: 0","time":"2024-02-21T20:05:14.301013313Z","caller":"pkg/publicapi/util.go:39"}
Feb 21 20:05:14 blackberry bacalhau[32772]: {"level":"error","NodeID":"QmNRY4XQ","Request":{"JobID":"5f237108-3e8a-401a-aba1-0223a190b716","URI":"/api/v1/requester/submit","Method":"POST","StatusCode":500,"Size":183,"Duration":0,"NodeID":"QmNRY4XQqroYueBjaGcnJXUAQngitqcJfV2Fu453zqMxEd","Ipaddr":"127.0.0.1","UserAgent":"Go-http-client/1.1"},"time":"2024-02-21T20:05:14.301058983Z","caller":"pkg/publicapi/handlerwrapper/log_handler.go:25"}
Feb 21 20:05:23 blackberry bacalhau[32772]: {"level":"debug","NodeID":"QmNRY4XQ","time":"2024-02-21T20:05:23.345595319Z","caller":"pkg/compute/sensors/logging_sensor.go:50","message":"ActiveJobs: []"}

We were having issues running Lilypad jobs with a Bacalhau sidecar, and it turned out to be because we didn't have the NVIDIA Container Toolkit installed. Ideally Bacalhau would detect that you're trying to run a GPU-enabled job, and that you don't have NVCT installed/enabled.

Zorlin avatar Feb 21 '24 21:02 Zorlin

There are a couple of complicating factors:

  • We support non-NVIDIA GPUs, i.e. AMD and Intel ones, which come with their own toolkits. So the presence of the NVIDIA toolkit isn't necessary to run any GPU-enabled job.
  • In this case the system is kinda doing what it is meant to, i.e. telling you that it can't run the job because no node has a GPU... I.e. I think it's broadly correct that nodes that don't have GPUs should be allowed to be part of a network, and jobs shouldn't be sent to them if they aren't relevant.

Are you running Bacalhau via the SDK? On the CLI we publish the reason that jobs aren't scheduled to nodes, which in this case would print that the number of GPUs is not available on that node. I can't see that in your logs...

We do also print log messages on node boot about the lack of toolkits available to query GPUs, but they're printed at DEBUG level because everyone found them annoying. I'm guessing they weren't obvious?

simonwo avatar Feb 29 '24 07:02 simonwo

Are you running Bacalhau via the SDK? On the CLI we publish the reason that jobs aren't scheduled to nodes, which in this case would print that the number of GPUs is not available on that node. I can't see that in your logs...

No, running Bacalhau as a systemd service and then interacting with it via the Lilypad CLI.

All of the above makes sense to me though and sounds reasonable.

Zorlin avatar Mar 13 '24 18:03 Zorlin