tensordict icon indicating copy to clipboard operation
tensordict copied to clipboard

[Feature Request] Array of dicts of tensors structure

Open vadimkantorov opened this issue 2 years ago • 3 comments

Storing them in "columnar format" can be more compact in some circumstances (e.g. they are copy-on-write safe in multiprocessing context as they are stored as a very small number of tensors not depending on "dataset" size): https://gist.github.com/vadimkantorov/86c3a46bf25bed3ad45d043ae86fff57

vadimkantorov avatar Nov 07 '22 19:11 vadimkantorov

Thanks for this @vadimkantorov How do you see this interacting with TensorDict? Should Arrays of dicts be a possible data type stored by tensordict? Do you have a typical use case in mind?

vmoens avatar Nov 08 '22 12:11 vmoens

I don't know much about TensorDcit project. I just wanted to share a usecase I had for dicts of tensors: represent a dataset in a way that avoids copy-on-write problems: https://github.com/pytorch/pytorch/issues/13246

I represented this array of dicts of tensors as columnar dict of tensors, each key is a tensor that concats all per-items tensors related to a given key.

One way it could integrate with TensorDict: provide a constructor/util function and a "indexing/getitem" method/util that would do slicing of all keys in TensorDict and return a new, "per-item" TensorDict. These may be just recipes from docs or util functions + tests that no copy-on-write/memory expansion is indeed happening, and such structure is safely shared in multiprocessing/dataloading without any copies

Also, similar usecase may be for collecting some partial results from validation loop. Usually, one would store them in a list of dicts of tensors and then analyze it somehow. If such a structure is implemented in some extendable way (as proposed here: https://github.com/pytorch/pytorch/issues/64359), it could be useful

vadimkantorov avatar Nov 08 '22 12:11 vadimkantorov

Also, similar usecase may be for collecting some partial results from validation loop. Usually, one would store them in a list of dicts of tensors and then analyze it somehow. If such a structure is implemented in some extendable way (as proposed here: https://github.com/pytorch/pytorch/issues/64359), it could be useful

That is something we have I think.

Here's an example:

>>> tensordict1 = TensorDict({"a": torch.zeros(1, 1)}, [1])
>>> tensordict2 = TensorDict({"a": torch.ones(1, 1)}, [1])
>>> tensordict = torch.stack([tensordict1, tensordict2], 0)
>>> 
>>> tensordict
LazyStackedTensorDict(
    fields={
        a: Tensor(torch.Size([2, 1, 1]), dtype=torch.float32)},
    batch_size=torch.Size([2, 1]),
    device=None,
    is_shared=False)
>>>
>>> tensordict[0] is tensordict1
True
>>> tensordict["a"]
tensor([[[0.]],

        [[1.]]])

The LazyStackedTensorDict does not currently support appending but we might consider that.

vmoens avatar Nov 08 '22 16:11 vmoens