spikeinterface icon indicating copy to clipboard operation
spikeinterface copied to clipboard

Additional checks on GPU when running containerised sorting

Open JoeZiminski opened this issue 1 year ago • 3 comments

I often have a hard time debugging GPU issues when running containerised sorting. There are already a lot of useful checks in SI and performing checks is not always straightforward e.g. #1398 as can't always rely on commands like nvidia-smi being accessible. However I think there might be room for a couple more useful checks, I've divided this into 'easy checks' and 'harder checks'.

Easier Checks

If would be good to check that docker or singularity is installed at all on the system at all (e.g. docker --version returns nonzero code). At the moment if not installed it gives a unclear error.

If you are running singularity spython is a dependency and if docker then docker (downloaded from PyPi). However these can't be installed easily by default as they don't all work cross platform. You could do something like the below in the pyproject.toml

"docker; platform_system=='Windows'",
"docker; platform_system=='Darwin'",
"spython; platform_system=='Linux'",  # I think missing from SI?
"cuda-python; platform_system != 'Darwin'",

Alternatively, it would be nice to raise an error if trying to run with docker and docker is not installed or running with singularity and spython is not installed.

Harder Checks

Today we had an issue that was quite hard to debug where nvidia-docker was not installed but was required. It would be great to check if running on docker nvidia-docker is installed (some details here). However, it sounds like this is a bit of a nightmare, AFAIK it's not cross-platform and nvidia-docker is now superseded by nvidia-container-toolkit. Nonetheless it might be worth checking that if trying to use docker, and on Linux, that either nvidia-docker or nvidia-docker2 or nvidia-docker-toolkit is installed (some more detailed explanation between their differences here. The best would be to check this through docker directly but I don't think it's possible.

JoeZiminski avatar May 15 '24 16:05 JoeZiminski

I think there are some hanging fruit that you pointed out. That's great! We should make a TODO list : )

h-mayorquin avatar May 15 '24 17:05 h-mayorquin

Great! How about:

  1. Add checks that if trying to run in docker, docker is installed, if running singularity, singularity is installed (something like subprocess.run("docker --version") returncode is 0.
  2. Add checks that if trying to run in docker, python module docker is installed. If running singularity, spython is installed.
  3. (tentative) check that if running docker and on linux, nvidia-docker or nvidia-docker2 or nvidia-docker-toolkit is installed. Will do a bit more reading but I think this covers the key dependencies. Testing across CI, a few of us, and I can ask dandi-hub to test. If all these pass it should be okay? I'm hoping it will be less tricky that nvidia-smi as will just be a 'is this installed' test, similar to docker and singularity above.

JoeZiminski avatar May 16 '24 11:05 JoeZiminski

Great! Thanks for the actionable plan.

h-mayorquin avatar May 16 '24 13:05 h-mayorquin

@JoeZiminski this could be another "quick" project for the hackahton! Would you like to add it to the list of projects?

alejoe91 avatar May 23 '24 08:05 alejoe91

Thanks both, sorry @alejoe91 missed this prior to the hackathon!

I will open a PR for (1) and (2) now. I think for (3), because I am not 100% sure exactly how it will work and am worried about raising an error, how about instead raising a warning? e.g. if they these dependencies are not found on a linux machine when using docker, something like nvidia-docker, nvidia-docker2 or nvidia-docker-toolkit not found, this may cause an error when running GPU-dependent dockers. Install XXX (not exactly sure which one, I think the toolkit is most recent) to fix this error.'

JoeZiminski avatar Jun 12 '24 17:06 JoeZiminski