aitextgen
aitextgen copied to clipboard
train.py is unable to find nvidia-smi.exe on Windows 10
When attempting to train using the example provided the root README.md
file, it fails on Windows 10 with the following error:
Traceback (most recent call last):
File "e:/gpt/aitextgen/train.py", line 30, in <module>
ai.train(data, batch_size=8, num_steps=50000, generate_every=5000, save_every=5000)
File "E:\Anaconda3\envs\aitextgen\lib\site-packages\aitextgen\aitextgen.py", line 707, in train
trainer.fit(train_model)
File "E:\Anaconda3\envs\aitextgen\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 513, in fit
self.dispatch()
File "E:\Anaconda3\envs\aitextgen\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 553, in dispatch
File "E:\Anaconda3\envs\aitextgen\lib\site-packages\pytorch_lightning\plugins\training_type\training_type_plugin.py", line 111, in start_training
File "E:\Anaconda3\envs\aitextgen\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 644, in run_train
File "E:\Anaconda3\envs\aitextgen\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 506, in run_training_epoch
self.on_train_batch_end(epoch_output, batch_end_outputs, batch, batch_idx, dataloader_idx)
File "E:\Anaconda3\envs\aitextgen\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 208, in on_train_batch_end
self.trainer.call_hook('on_batch_end')
File "E:\Anaconda3\envs\aitextgen\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1101, in call_hook
trainer_hook(*args, **kwargs)
File "E:\Anaconda3\envs\aitextgen\lib\site-packages\pytorch_lightning\trainer\callback_hook.py", line 145, in on_batch_end
callback.on_batch_end(self, self.lightning_module)
File "E:\Anaconda3\envs\aitextgen\lib\site-packages\aitextgen\train.py", line 160, in on_batch_end
result = subprocess.run(
File "E:\Anaconda3\envs\aitextgen\lib\subprocess.py", line 489, in run
with Popen(*popenargs, **kwargs) as process:
File "E:\Anaconda3\envs\aitextgen\lib\subprocess.py", line 854, in __init__
self._execute_child(args, executable, preexec_fn, close_fds,
File "E:\Anaconda3\envs\aitextgen\lib\subprocess.py", line 1247, in _execute_child
args = list2cmdline(args)
File "E:\Anaconda3\envs\aitextgen\lib\subprocess.py", line 549, in list2cmdline
for arg in map(os.fsdecode, seq):
File "E:\Anaconda3\envs\aitextgen\lib\os.py", line 818, in fsdecode
filename = fspath(filename) # Does type-checking of `filename`.
TypeError: expected str, bytes or os.PathLike object, not NoneType
Digging a bit deeper, this appears to be because train.py
is looking for nvidia-smi
with shutil.which("nvidia-smi")
. Unfortunately, nvidia-smi.exe
is not in the executable path by default on Windows.
I worked around this by simply copying nvidia-smi.exe
into my work directory. For informational purposes, the default install location for nidia-smi.exe
seems to be C:\Program Files\NVIDIA Corporation\NVSMI
. Adding this to one's path is another workaround.
I poked around on StackExchange hoping to find an alternative to getting GPU memory status on Windows and came up empty.
This may likely be an issue that should also be filed with pytorch-lightning
as my approach for that mostly duplicates the one used there.