aitextgen icon indicating copy to clipboard operation
aitextgen copied to clipboard

train.py is unable to find nvidia-smi.exe on Windows 10

Open MarcusLlewellyn opened this issue 3 years ago • 1 comments

When attempting to train using the example provided the root README.md file, it fails on Windows 10 with the following error:

Traceback (most recent call last):
  File "e:/gpt/aitextgen/train.py", line 30, in <module>
    ai.train(data, batch_size=8, num_steps=50000, generate_every=5000, save_every=5000)
  File "E:\Anaconda3\envs\aitextgen\lib\site-packages\aitextgen\aitextgen.py", line 707, in train
    trainer.fit(train_model)
  File "E:\Anaconda3\envs\aitextgen\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 513, in fit
    self.dispatch()
  File "E:\Anaconda3\envs\aitextgen\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 553, in dispatch
  File "E:\Anaconda3\envs\aitextgen\lib\site-packages\pytorch_lightning\plugins\training_type\training_type_plugin.py", line 111, in start_training
  File "E:\Anaconda3\envs\aitextgen\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 644, in run_train
  File "E:\Anaconda3\envs\aitextgen\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 506, in run_training_epoch
    self.on_train_batch_end(epoch_output, batch_end_outputs, batch, batch_idx, dataloader_idx)
  File "E:\Anaconda3\envs\aitextgen\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 208, in on_train_batch_end
    self.trainer.call_hook('on_batch_end')
  File "E:\Anaconda3\envs\aitextgen\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1101, in call_hook
    trainer_hook(*args, **kwargs)
  File "E:\Anaconda3\envs\aitextgen\lib\site-packages\pytorch_lightning\trainer\callback_hook.py", line 145, in on_batch_end
    callback.on_batch_end(self, self.lightning_module)
  File "E:\Anaconda3\envs\aitextgen\lib\site-packages\aitextgen\train.py", line 160, in on_batch_end
    result = subprocess.run(
  File "E:\Anaconda3\envs\aitextgen\lib\subprocess.py", line 489, in run
    with Popen(*popenargs, **kwargs) as process:
  File "E:\Anaconda3\envs\aitextgen\lib\subprocess.py", line 854, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "E:\Anaconda3\envs\aitextgen\lib\subprocess.py", line 1247, in _execute_child
    args = list2cmdline(args)
  File "E:\Anaconda3\envs\aitextgen\lib\subprocess.py", line 549, in list2cmdline
    for arg in map(os.fsdecode, seq):
  File "E:\Anaconda3\envs\aitextgen\lib\os.py", line 818, in fsdecode
    filename = fspath(filename)  # Does type-checking of `filename`.
TypeError: expected str, bytes or os.PathLike object, not NoneType

Digging a bit deeper, this appears to be because train.py is looking for nvidia-smi with shutil.which("nvidia-smi"). Unfortunately, nvidia-smi.exe is not in the executable path by default on Windows.

I worked around this by simply copying nvidia-smi.exe into my work directory. For informational purposes, the default install location for nidia-smi.exe seems to be C:\Program Files\NVIDIA Corporation\NVSMI. Adding this to one's path is another workaround.

I poked around on StackExchange hoping to find an alternative to getting GPU memory status on Windows and came up empty.

MarcusLlewellyn avatar Mar 07 '21 19:03 MarcusLlewellyn

This may likely be an issue that should also be filed with pytorch-lightning as my approach for that mostly duplicates the one used there.

minimaxir avatar Mar 09 '21 04:03 minimaxir