minerl
minerl copied to clipboard
Data Iteration Issues
When iterating through Navigate, NavigateExtreme, and NavigateDense data using data.batch_iter(), I get one of the following errors. These do not occur at the beginning of my batch iteration loop, but instead after more than 100 iterations into training:
Traceback (most recent call last):
File "train-extreme-loaded.py", line 35, in <module>
for s, a, r, sp, d in data.batch_iter(
File "/home/jack/.local/lib/python3.8/site-packages/minerl/data/data_pipeline.py", line 405, in batch_iter
for seg_batch in minibatch_gen(traj_iter(), batch_size=batch_size, nsteps=seq_len):
File "/home/jack/.local/lib/python3.8/site-packages/minerl/data/util/__init__.py", line 269, in minibatch_gen
trajs[i] = t = multimap(cat, *[t, next(traj_iter)])
File "/home/jack/.local/lib/python3.8/site-packages/minerl/data/data_pipeline.py", line 385, in traj_iter
s, a, r, sp1, d = trajectory_queue.get()
TypeError: cannot unpack non-iterable NoneType object
it=171 Loss: 10.137125015258789
it=172 Loss: 8.685365676879883
Exception in thread QueueManagerThread:
Traceback (most recent call last):
File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/usr/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib/python3.8/concurrent/futures/process.py", line 441, in _queue_management_worker
shutdown_worker()
File "/usr/lib/python3.8/concurrent/futures/process.py", line 334, in shutdown_worker
call_queue.put_nowait(None)
File "/usr/lib/python3.8/multiprocessing/queues.py", line 132, in put_nowait
return self.put(obj, False)
File "/usr/lib/python3.8/multiprocessing/queues.py", line 82, in put
raise ValueError(f"Queue {self!r} is closed")
ValueError: Queue <concurrent.futures.process._SafeQueue object at 0x7f2d5967ce80> is closed
it=159 Loss: 8.279386520385742
it=160 Loss: 11.42676830291748
Exception in thread QueueManagerThread:
Traceback (most recent call last):
File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/usr/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib/python3.8/concurrent/futures/process.py", line 376, in _queue_management_worker
thread_wakeup.clear()
File "/usr/lib/python3.8/concurrent/futures/process.py", line 94, in clear
self._reader.recv_bytes()
File "/usr/lib/python3.8/multiprocessing/connection.py", line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
File "/usr/lib/python3.8/multiprocessing/connection.py", line 414, in _recv_bytes
buf = self._recv(4)
File "/usr/lib/python3.8/multiprocessing/connection.py", line 379, in _recv
chunk = read(handle, remaining)
OSError: [Errno 9] Bad file descriptor
I'm running Arch Linux using PyTorch on an Nvidia GTX 1050M. I'm currently using Python 3.8.6. My main training script is below (with utilities and model implementation ommitted)
import sys
import gym
import minerl
import torch
from torch import nn
from model import Model
from utils import *
torch.cuda.set_device(0)
if torch.cuda.is_available():
dev = "cuda:0"
else:
dev = "cpu"
LR = 0.0001
SEQ_LEN = 16
BATCH_SIZE = 64
# Sample some data from the dataset!
PATH="model_state_dict"
data = minerl.data.make("MineRLNavigateDense-v0")
model = Model(2, 200).cuda()
model.load_state_dict(torch.load(PATH))
cross_ent = nn.CrossEntropyLoss().cuda()
mse = nn.MSELoss().cuda()
# optimizer = torch.optim.SGD(model.parameters(), lr=LR)
optimizer = torch.optim.Adam(model.parameters(), lr=LR)
old_loss = float("Inf")
# Iterate through a single epoch using sequences of at most 32 steps
it = 0
for s, a, r, sp, d in data.batch_iter(num_epochs=1, seq_len=SEQ_LEN, batch_size=BATCH_SIZE):
# optimizer changes params, loss computes the gradients
# dicts are arrays of samples
pov_tensor, feat_tensor = Navigatev0_obs_to_tensor(s)
pov_tensor = pov_tensor.cuda()
feat_tensor = feat_tensor.cuda()
pov_tensor, feat_tensor = (
torch.transpose(expand(pov_tensor), 1, 3),
expand(feat_tensor),
)
action_tensor = Navigatev0_action_to_tensor(a)
action_tensor = {a: expand(t).cuda() for a, t in action_tensor.items()}
# 4. Write a training loop
optimizer.zero_grad()
outputs = model(pov_tensor, feat_tensor).cuda()
loss = cross_ent(outputs[:, 0:2], action_tensor["attack"])
loss += cross_ent(outputs[:, 2:4], action_tensor["back"])
loss = mse(outputs[:, 4:6], action_tensor["camera"])
loss += cross_ent(outputs[:, 6:8], action_tensor["forward"])
loss += cross_ent(outputs[:, 8:10], action_tensor["jump"])
loss += cross_ent(outputs[:, 10:12], action_tensor["left"])
loss += cross_ent(outputs[:, 12:14], action_tensor["right"])
loss += cross_ent(outputs[:, 14:16], action_tensor["place"])
loss += cross_ent(outputs[:, 16:18], action_tensor["sneak"])
loss += cross_ent(outputs[:, 18:20], action_tensor["sprint"])
loss.backward()
optimizer.step()
it += 1
if it % 1 == 0:
print(f"{it=} Loss: {loss.item()}")
if it >= 1000:
# if loss.item() > old_loss + 0.5 or it >= 30:
print(f"Converged at iter {it} with loss {loss.item()}")
break
old_loss = loss.item()
I have tried re-downloading parts of the dataset in case it was corrupted in someway, but no dice. Any help is greatly appreciated.
Update: After talking with Miffyli on the #support channel, two of the errors are no longer with MineRL: The OSError: [Errno 9] Bad file descriptor is apparently intended behavior, and ValueError: Queue <concurrent.futures.process._SafeQueue object at 0x7f2d5967ce80> is closed is apparently a multi-processing issue with PyTorch. However, I still don't understand the NoneType error:
File "/home/jack/.local/lib/python3.8/site-packages/minerl/data/data_pipeline.py", line 385, in traj_iter
s, a, r, sp1, d = trajectory_queue.get()
TypeError: cannot unpack non-iterable NoneType object
Code to produce the aforementioned error
import gym
import minerl
it = 0
data = minerl.data.make("MineRLNavigateExtreme-v0")
for s, a, r, sp, d in data.batch_iter(num_epochs=10, seq_len=16, batch_size=64):
print(it)
it+=1
I can confirm above error happens with code above and with "MineRLNavigateExtremeDense-v0" data (I do not have MineRLNavigateExtreme-v0, despite we have same VERSION=3). It seems to happen at some specific data file after few iterations over batch_iter.
Try data.load_data instead of data.batch_iter