skorch
skorch copied to clipboard
get_len has issues with nested lists
For example:
inferno.dataset.get_len([[(1,2),(2,3)],[(4,5)], [(7,8,9)]])
expected: 3
actual: ValueError: Dataset does not have consistent lengths.
Another example:
inferno.dataset.get_len([[(1,2),(2,3)],[(4,5)], [(7,8)]])
expected: 3 actual: 2 (length of tuples)
A workaround is to convert the list into a numpy array.
This is also an issue with Skorch (not inferno) when using an EmbeddingBag which expects a variable length tensor (a batch of n concatenated sequences), and a tensor containing n offsets. For now the only work around is to concatenate the tensors inside of forward(). The issue occurs when self.history.record_batch('train_batch_size', get_len(Xi)) is called within "fit_loop" of NeuralNet.
For example when passing the return from the following as data:
def datasets(df_x_train, df_x_test, series_y_train):
x_train = df_x_train[['idxs', 'idx_lens']].values
x_test = df_x_test[['idxs', 'idx_lens']].values
y_train = series_y_train.cat.codes.values
return x_train, x_test, y_train
and the following for generating batches to arg "iterator_train__collate_fn" or "iterator_valid__collate_fn"
def generate_batch(self, batch):
idxs = [torch.tensor(o[0][0], dtype=torch.long, device=self.device) for o in batch]
idxs = torch.cat(tensors=idxs, dim=0)
offsets = torch.tensor(([0] + [o[0][1] for o in batch])[:-1]).cumsum(dim=0)
x = {'idxs': idxs, 'offsets': offsets}
y = torch.tensor([o[1] for o in batch], dtype=torch.long, device=self.device)
return x, y
This is also an issue with Skorch (not inferno)
The OP wrote "inferno" because that was the first name of skorch ;)
Regarding the actual issue, I cannot quite infer what the data will look like. If you have any suggestion how to change get_len
to work correctly with the provided examples, that would be appreciated.
In these particular instances, the issue occurs when attempting to calculate the batch size. Since the size of the batches are determined by the batch size argument(s) given to the estimator, couldn’t these values be used instead, thereby avoiding the need to even call get_len?
Could show a sample of Xi
so that I can understand better?
any progress on this front? the issue seems to be with the unpacking of get_len
@pgg1610 I would like to make some progress with this as well. Ideally, you could provide examples where the function fails or provides wrong numbers for you. With those, we could extend the test suite and then adjust the function to return the right results. Without examples, it's a bit difficult.
@BenjaminBossan hello, the function seems to be failing for a nested dataset. In my case dataset object is a numpy array of [X, y] wherein, X itself is composed of 3 tensors which form the input to my nn.Module. I have a custom defined __len__
defined for the torch Dataset class which is used to make the [X,y] so as a hack I am calling that instead of the skorch's get_len routine which seems to unpack the dataset completely
[_apply_to_data(data, _len, unpack_dict=True)]
and list(flatten(lens))
. For better clarity, I could mail/share the main code and the changes I've made.
Yes, it would be nice if you could share some code. For that, I don't need the actual data, just something of the same structure, which could be used for a unit test. Thanks for helping.
okay let me see how I can provide that -- I am relatively new to Github forums so please excuse my slow replies.
Take the time that you need.
On the question how: What I like to do is code up a snippet using for instance Jupyter notebook, make sure it shows the intended effect, cleaning up all that's unnecessary and then paste the code to github.
I have uploaded the notebook with the error and the correction I had to do. get_len_bug.zip
Hi @pgg1610 I tried to make your example work but can't train the net since it's still missing some data (FileNotFoundError: [Errno 2] No such file or directory: '/Users/pghaneka/OneDrive - purdue.edu/WORK/Projects_data/MODULES/cgcnn/skorch_cgcnn/sample_regression/9011998.cif'
). However, it seems I can inspect your training data. This is what I found out:
- your target looks reasonable, shouldn't cause any trouble
- your input data is a numpy array of length 9 -- I interpret this to mean that you have 9 samples
- each of the 9 elements is a tuple of length 3
- the 3rd element is a string, which is not well supported by PyTorch (but it looks like you could cast it to int) -- is this really needed?
- the 2nd element is a torch tensor of size 1 (basically just a scalar) and it has the values as the target -- is this really needed?
- the 1st element is a tuple again, of length 3
- all of those 3 elements are torch tensors with possibly different shapes (but the first dimension, for each element, is identical)
To be honest, I'm somewhat at a loss here and can't promise that skorch will ever provide a solution that will generically work with this kind of data. Maybe you can somehow simplify the data and make the problem go away?
If that is not possible, I would just code up my own custom Dataset
class that wraps all of this. For this dataset to work with skorch, the only requirement is that the __getitem__
method returns exactly 2 elements (X
and y
). This should side-step all the get_len
mess.
If you can provide a patch for skorch that fixes your problem without breaking the existing tests, we could also accept that. But honestly I don't feel like I could provide such a patch myself.