Additional checks on GPU when running containerised sorting
I often have a hard time debugging GPU issues when running containerised sorting. There are already a lot of useful checks in SI and performing checks is not always straightforward e.g. #1398 as can't always rely on commands like nvidia-smi being accessible. However I think there might be room for a couple more useful checks, I've divided this into 'easy checks' and 'harder checks'.
Easier Checks
If would be good to check that docker or singularity is installed at all on the system at all (e.g. docker --version returns nonzero code). At the moment if not installed it gives a unclear error.
If you are running singularity spython is a dependency and if docker then docker (downloaded from PyPi). However these can't be installed easily by default as they don't all work cross platform. You could do something like the below in the pyproject.toml
"docker; platform_system=='Windows'",
"docker; platform_system=='Darwin'",
"spython; platform_system=='Linux'", # I think missing from SI?
"cuda-python; platform_system != 'Darwin'",
Alternatively, it would be nice to raise an error if trying to run with docker and docker is not installed or running with singularity and spython is not installed.
Harder Checks
Today we had an issue that was quite hard to debug where nvidia-docker was not installed but was required. It would be great to check if running on docker nvidia-docker is installed (some details here). However, it sounds like this is a bit of a nightmare, AFAIK it's not cross-platform and nvidia-docker is now superseded by nvidia-container-toolkit. Nonetheless it might be worth checking that if trying to use docker, and on Linux, that either nvidia-docker or nvidia-docker2 or nvidia-docker-toolkit is installed (some more detailed explanation between their differences here. The best would be to check this through docker directly but I don't think it's possible.
I think there are some hanging fruit that you pointed out. That's great! We should make a TODO list : )
Great! How about:
- Add checks that if trying to run in docker, docker is installed, if running singularity, singularity is installed (something like
subprocess.run("docker --version")returncode is 0. - Add checks that if trying to run in docker, python module
dockeris installed. If running singularity,spythonis installed. - (tentative) check that if running docker and on linux,
nvidia-dockerornvidia-docker2ornvidia-docker-toolkitis installed. Will do a bit more reading but I think this covers the key dependencies. Testing across CI, a few of us, and I can ask dandi-hub to test. If all these pass it should be okay? I'm hoping it will be less tricky thatnvidia-smias will just be a 'is this installed' test, similar to docker and singularity above.
Great! Thanks for the actionable plan.
@JoeZiminski this could be another "quick" project for the hackahton! Would you like to add it to the list of projects?
Thanks both, sorry @alejoe91 missed this prior to the hackathon!
I will open a PR for (1) and (2) now. I think for (3), because I am not 100% sure exactly how it will work and am worried about raising an error, how about instead raising a warning? e.g. if they these dependencies are not found on a linux machine when using docker, something like nvidia-docker, nvidia-docker2 or nvidia-docker-toolkit not found, this may cause an error when running GPU-dependent dockers. Install XXX (not exactly sure which one, I think the toolkit is most recent) to fix this error.'