skorch icon indicating copy to clipboard operation
skorch copied to clipboard

get_len has issues with nested lists

Open ottonemo opened this issue 7 years ago • 12 comments

For example:

inferno.dataset.get_len([[(1,2),(2,3)],[(4,5)], [(7,8,9)]])

expected: 3 actual: ValueError: Dataset does not have consistent lengths.

Another example:

inferno.dataset.get_len([[(1,2),(2,3)],[(4,5)], [(7,8)]])

expected: 3 actual: 2 (length of tuples)

A workaround is to convert the list into a numpy array.

ottonemo avatar Aug 11 '17 09:08 ottonemo

This is also an issue with Skorch (not inferno) when using an EmbeddingBag which expects a variable length tensor (a batch of n concatenated sequences), and a tensor containing n offsets. For now the only work around is to concatenate the tensors inside of forward(). The issue occurs when self.history.record_batch('train_batch_size', get_len(Xi)) is called within "fit_loop" of NeuralNet.

For example when passing the return from the following as data:

def datasets(df_x_train, df_x_test, series_y_train):
    x_train = df_x_train[['idxs', 'idx_lens']].values
    x_test = df_x_test[['idxs', 'idx_lens']].values
    y_train = series_y_train.cat.codes.values
    return x_train, x_test, y_train

and the following for generating batches to arg "iterator_train__collate_fn" or "iterator_valid__collate_fn"

def generate_batch(self, batch):
    idxs = [torch.tensor(o[0][0], dtype=torch.long, device=self.device) for o in batch]
    idxs = torch.cat(tensors=idxs, dim=0)
    offsets = torch.tensor(([0] + [o[0][1] for o in batch])[:-1]).cumsum(dim=0)
    x = {'idxs': idxs, 'offsets': offsets}
    y = torch.tensor([o[1] for o in batch], dtype=torch.long, device=self.device)
    return x, y

gregorywalsh avatar Aug 28 '19 19:08 gregorywalsh

This is also an issue with Skorch (not inferno)

The OP wrote "inferno" because that was the first name of skorch ;)

Regarding the actual issue, I cannot quite infer what the data will look like. If you have any suggestion how to change get_len to work correctly with the provided examples, that would be appreciated.

BenjaminBossan avatar Aug 30 '19 20:08 BenjaminBossan

In these particular instances, the issue occurs when attempting to calculate the batch size. Since the size of the batches are determined by the batch size argument(s) given to the estimator, couldn’t these values be used instead, thereby avoiding the need to even call get_len?

gregorywalsh avatar Aug 30 '19 20:08 gregorywalsh

Could show a sample of Xi so that I can understand better?

BenjaminBossan avatar Aug 31 '19 08:08 BenjaminBossan

any progress on this front? the issue seems to be with the unpacking of get_len

pgg1610 avatar Mar 30 '20 20:03 pgg1610

@pgg1610 I would like to make some progress with this as well. Ideally, you could provide examples where the function fails or provides wrong numbers for you. With those, we could extend the test suite and then adjust the function to return the right results. Without examples, it's a bit difficult.

BenjaminBossan avatar Mar 31 '20 19:03 BenjaminBossan

@BenjaminBossan hello, the function seems to be failing for a nested dataset. In my case dataset object is a numpy array of [X, y] wherein, X itself is composed of 3 tensors which form the input to my nn.Module. I have a custom defined __len__ defined for the torch Dataset class which is used to make the [X,y] so as a hack I am calling that instead of the skorch's get_len routine which seems to unpack the dataset completely [_apply_to_data(data, _len, unpack_dict=True)] and list(flatten(lens)). For better clarity, I could mail/share the main code and the changes I've made.

pgg1610 avatar Apr 03 '20 03:04 pgg1610

Yes, it would be nice if you could share some code. For that, I don't need the actual data, just something of the same structure, which could be used for a unit test. Thanks for helping.

BenjaminBossan avatar Apr 04 '20 14:04 BenjaminBossan

okay let me see how I can provide that -- I am relatively new to Github forums so please excuse my slow replies.

pgg1610 avatar Apr 08 '20 01:04 pgg1610

Take the time that you need.

On the question how: What I like to do is code up a snippet using for instance Jupyter notebook, make sure it shows the intended effect, cleaning up all that's unnecessary and then paste the code to github.

BenjaminBossan avatar Apr 08 '20 19:04 BenjaminBossan

I have uploaded the notebook with the error and the correction I had to do. get_len_bug.zip

pgg1610 avatar Apr 12 '20 17:04 pgg1610

Hi @pgg1610 I tried to make your example work but can't train the net since it's still missing some data (FileNotFoundError: [Errno 2] No such file or directory: '/Users/pghaneka/OneDrive - purdue.edu/WORK/Projects_data/MODULES/cgcnn/skorch_cgcnn/sample_regression/9011998.cif'). However, it seems I can inspect your training data. This is what I found out:

  • your target looks reasonable, shouldn't cause any trouble
  • your input data is a numpy array of length 9 -- I interpret this to mean that you have 9 samples
  • each of the 9 elements is a tuple of length 3
  • the 3rd element is a string, which is not well supported by PyTorch (but it looks like you could cast it to int) -- is this really needed?
  • the 2nd element is a torch tensor of size 1 (basically just a scalar) and it has the values as the target -- is this really needed?
  • the 1st element is a tuple again, of length 3
  • all of those 3 elements are torch tensors with possibly different shapes (but the first dimension, for each element, is identical)

To be honest, I'm somewhat at a loss here and can't promise that skorch will ever provide a solution that will generically work with this kind of data. Maybe you can somehow simplify the data and make the problem go away?

If that is not possible, I would just code up my own custom Dataset class that wraps all of this. For this dataset to work with skorch, the only requirement is that the __getitem__ method returns exactly 2 elements (X and y). This should side-step all the get_len mess.

If you can provide a patch for skorch that fixes your problem without breaking the existing tests, we could also accept that. But honestly I don't feel like I could provide such a patch myself.

BenjaminBossan avatar Apr 13 '20 15:04 BenjaminBossan