tonic Use Pytorch Data and new DataPipes

We're currently using an untested copy of torchvision's download utilities, which is far from ideal. Pytorch Data separates that functionality into another package and introduces data pipes. It is still in a beta stage, but it's maturing quickly. Having seen this kind of functional interface in Tensorflow data, I think it looks very clean and is very customizable. As the amount of datasets in Tonic grows, we will benefit from the different dataloading functionalities in Pytorch data, although it will need a considerable re-write in the beginning

Jul 05 '22 07:07 biphasic

Consider #167, and given the possibility provided by DataPipes, I was thinking about giving the user the possibility to apply externally the transform to the events, by making they aware of the datapipe structure, instead of embedding these in the dataset. The transform would be mapped to a IterDataPipe object through the map() method.

Here's an example code:

from tonic.prototype.datasets import nmnist
from tonic.prototype.transforms import ToFrame

dp = nmnist.nmnist(root="./data")
t = ToFrame(sensor_size=nmnist.sensor_size, time_window=10000)

# Mapping the transform to the DataPipe externally
dp = dp.map(t)

The following modifications are needed to our transforms. Right now, the DataPipe provides in output an IterDataPipe object (or something like that) which, when passed through a Mapper object, returns a tuple (events, target) to the specified function. What I have done is to modify this line in the following way, by passing the datapipe instead of the events to the transform.

def __call__(self, dp):
    return functional.to_frame_numpy(
        dp=dp, # The datapipe is passed instead of the events.
        ...

and I replaced the functional to_frame with the following code: first I unpack the tuple, the I transform the events and i return a tuple of frames and target.

def to_frame_numpy(
    dp, # Datapipe passed.
    ...
):
    ...
    # Unpacking the datapipe.
    events, target = dp
    ...
    # Getting frames from events
    frames = ...
    ...
    # Repacking the datapipe.
    return frames, target

Is this what you had in mind, @biphasic? It should be pretty straightforward to apply transforms externally thanks to the map() method of the datapipe. This way, the dataset would be extremely flexible and one could even define a custom transform as a simple function and map it to the datapipe, without creating a class and so on.

Aug 17 '22 14:08 fabrizio-ottati

I'm not really sure yet to be honest. I still don't understand how torchvisions new API is going to look like for datasets and transforms combined. I think it'll be good if we let those guys iterate over a few solutions until they find something that works. There is little documentation available at this point but I also found this one https://github.com/pytorch/vision/blob/main/torchvision/prototype/features/_image.py and this one https://github.com/pytorch/vision/blob/main/torchvision/prototype/features/_feature.py which are actively being worked on it seems. It would be good if our transforms can also be chained with torchvisions, for example to go tonic.transforms.ToFrame(...) and then torchvision.transforms.RandomCrop(...). I don't really know a good solution at this point :thinking:

Aug 17 '22 18:08 biphasic

I agree with you on waiting for what they with their transforms and then copy them :) I see that they are creating dataset classes from IterDataPipe, hence they do not intend to use the functional form of the dataset (?). We'll see

Aug 18 '22 09:08 fabrizio-ottati

I don't know about you @biphasic but I completely missed this torchvision folder in which they have already implemented with data pipes many datasets :laughing:

Aug 28 '22 14:08 fabrizio-ottati

good point, there is lots of stuff that they did already and I have yet to understand. like this one too https://github.com/pytorch/vision/blob/main/torchvision/prototype/datasets/_api.py

Aug 28 '22 15:08 biphasic

I'll try to summarize here what I understood studying the torchvision code. Their README carefully explains how one should implement a datapiped dataset, hence I used it as inspiration for us.

Their dataset class is the template to implement a generic dataset. Fundamentally, one needs to implement a Dataset class with three methods.

A datapipe() method that describes how the sample is extracted from the file. For us, in NMNIST for instance, is something like this:

    def _datapipe(self) -> IterDataPipe[Sample]:
        filename = self._TRAIN_FILENAME if self.train else self._TEST_FILENAME
        filepath = os.path.join(self._root, filename)
        dp = FileLister(str(filepath))
        dp = FileOpener(dp, mode="b")
        # Unzipping.
        dp = ZipArchiveLoader(dp)
        # Filtering the non-bin files.
        dp = Filter(dp, self._filter, input_col=0)
        # Reading data to structured NumPy array and integer target.
        dp = NMNISTFileReader(dp)
        # Filtering the first saccade.
        if self.first_saccade_only:
            dp = Mapper(dp, self._saccade_filter, input_col=0)
        # Applying transforms.
        if self.transforms:
            dp = Mapper(dp, self.transforms)
        else:
            if self.transform:
                dp = Mapper(dp, self.transform, input_col=0, output_col=0)
            if self.target_transform:
                dp = Mapper(dp, self.target_transform, input_col=1, output_col=1)
        return dp

A NMNISTFileReader class is responsible for transforming the binary stream from the ZIP archive to a tuple (events, target). Actually, in torchvision they require this tuple to be a dictionary {'frames': torch.Tensor, 'target': Any}, which I originally implemented but then discarded since transforms would not work on a dictionary and the events could be transformed to frames or something else; hence, the key "events" would have made no sense.

A __len__() method which simply returns the number of samples in the dataset. For NMNIST:

    def __len__(self) -> int:
        return 60_000 if self.train else 10_000

Then, they require a resource() method. If one looks to their code, it seems to be a bunch of classes that you use to "say" to the datapipe if the dataset has to be downloaded to the internet (OnlineResource, HttpResource, GDriveResource) or if the user has to manually download it (ManualDownloadResource, KaggleDownloadResource). It seemed a little bit overkill for us to implement this now, hence I only used the first two methods.

I hope that this clarifies the situation a little bit :laughing: I also hope that I understood correctly what they are doing.

Aug 30 '22 10:08 fabrizio-ottati

keeping track of supported datasets here: https://github.com/neuromorphs/tonic/tree/develop/tonic/prototype#currently-supported-datasets

Sep 04 '22 07:09 biphasic

@biphasic should I try to add some testing routines for the prototypes?

Sep 04 '22 07:09 fabrizio-ottati

that is a very good next step. If we are able to test the prototype datasets, we can start including them in the package. In general it would be good to aim for a high test coverage (that means the number of lines covered by tests) for the new datasets. Our coverage for the current versions can be seen here: https://app.codecov.io/gh/neuromorphs/tonic/tree/develop/tonic/datasets

Ideally we rely on artificial data injection/creation to test that everything works, as in we create some artificial samples for the tests. But it's not always clear how the custom files are created, so for some datasets, I resorted to downloading single samples from a server that I had access to in the past like here: https://github.com/neuromorphs/tonic/blob/develop/test/test_datasets.py#L116

This server is not going to be available forever so where possible we should not rely on it.

Sep 04 '22 07:09 biphasic

OK, I will try to come up with something.

In the meanwhile, I tried to use their web datapipes and they seem pretty nice. Basically, you set up a datapipe HttpReader -> Mapper -> Saver and you can download anything you want. They have a GDriveReader for reading content from Google Drive. The code is the following:

def download_url(url: str, root: Union[Path, str], filename: str) -> None:
    # Reading the data from the web. 
    download_dp = HttpReader(IterableWrapper([url]))
    # Extracting the data with a function.
    def get_http_data(http_obj):
        return http_obj.data
    download_dp = Mapper(download_dp, get_http_data, input_col=1)
    # Saving to disk. 
    def filepath_fn(fpath):                                                                                                                                                                                        
        return os.path.join(root, filename)
    download_dp = Saver(download_dp, mode="wb", filepath_fn=filepath_fn)
    _ = next(iter(download_dp))

And you can use it in the following way.

What do you think @biphasic? Would it be worth to use as experimental download routine?

Sep 04 '22 12:09 fabrizio-ottati

Do you mean to use this as a drop-in replacement to the current tonic.download_utils.download_url? I think this would work, although some datasets call it with the md5 argument as well so that would need to be included (see function signature here https://github.com/neuromorphs/tonic/blob/develop/tonic/download_utils.py#L90-L96)

Overall though I'm not sure whether this is worth the work though! Because in the end we want to delete the whole download_utils anyway right? I think the next step would be to figure out how to write a test case for the prototype datasets. Because for Tonic 2.0 with the new version of datasets, we really want everything tested, in contrast to now where some datasets are not really tested and we have to rely on users telling us when things break. If for the moment we use download_utils.download_url in the tests to download some example event files, I'm not too worried about that. Because to me in the end we'll also get rid of any downloading during the tests in the end, it's just much more reliable that way. Please let me know if you have some preferences what you'd like to tackle, then we can certainly work out something and I take on some of the things. Generally I see the following way forward, please complement where you see fit:

Adding test cases for prototype datasets: If it helps for the prototype folder to show up on codecov.io, I'm happy to include the prototype folder in the package using __init__ files.
Updating datasets: Some datasets could benefit from the GdriveReader, such as MVSEC, where hdf5 files are only available on Gdrive but that has been complicated so far with the old API so hopefully we'll be able to make use of some of PyTorch Data's new features.
Thanks to your design we'll be able to use the transforms in their current form, so let's do that for the moment. If there is a road block from not switching to a new API for transforms then we tackle it later.
Eventually delete download_utils and include torchdata as a dependency

Sep 04 '22 14:09 biphasic

I would have left it in the prototype/utils folder as temporary replacement, given also the fact that torchivision is switching to SHA256 check from MD5 (in fact, a check_sha256 function is used in both NMNIST and STMNIST).

I agree with you on the fact that a proper test routine is needed and I will work on it as priority, at least for NMNIST and STMNIST.

For what concerns the Goggle Drive download, I think we can use the experimental API based on datapipes.

Concerning transforms, in torchvision they are using a new format for the output of the dataset: no more a tuple but a dictionary. I did not like the idea because it would mess up our transforms. We'll see in the future what they will do. They are also changing the API for some reasons that I did not understand, but I would stick with what we are doing now.

Edit: moreover, when it is not possible to generate on-the-fly fake data, could we keep some single-sample binary files inside the source code? They should not occupy so much space, am I right?

Sep 04 '22 15:09 fabrizio-ottati

Sounds good 👍 thank you

Sep 04 '22 15:09 biphasic

closing this for the moment as we're already working on this.

Feb 03 '23 12:02 biphasic

tonic tonic copied to clipboard

Use Pytorch Data and new DataPipes

tonic
tonic copied to clipboard