datumaro [Question / Feature Request] PyTorch dataset abstraction

[Question / Feature Request] PyTorch dataset abstraction

Open HalkScout opened this issue 4 months ago • 2 comments

All of the documentation I have sifted through has referenced essentially re-saving data when the format changes, but is there a way to use this library without that? A good use-case for this example is if you have a very large amount of data. You can read your data if it's a supported format, some conversion occurs, then you pass it into your pipeline. It would add to the cost of data-loading, but it can be worth it if it saves terabytes of disk space.

I am loading a dataset like this:

from datumaro.components.dataset import Dataset
dataset = Dataset.import_from("./data", "yolo")
print(dataset)

Dataset
	...
subsets
	test: # of items=...
	train: # of items=...
	val: # of items=...
infos
	...

Where I would want to implement a PyTorch Lightning data module like:

import lightning as L
from datumaro.components.dataset import Dataset

class MyDataModule(L.LightningDataModule):
    def __init__(self, batch_size: int = 32):
        super().__init__()
        self.batch_size = batch_size

    def setup(self, stage: str):
        # Being able to load only specific subsets would be nice here too, but that sounds like a large undertaking
        dataset = Dataset.import_from("./data", "yolo")
        if stage == "fit":
            dataset_train = dataset.get_subset("train")
            dataset_val = dataset.get_subset("val")
        if stage == "test":
            dataset_test = dataset.get_subset("test")

    def train_dataloader(self):
        return DataLoader(self.dataset_train, batch_size=self.batch_size)

    # and so on for "test" and "val"

Is this possible? Neither of these solutions I have tried work:

train = dataset.get_subset("train")
print(train.__getitem__(0))

AttributeError: 'DatasetSubset' object has no attribute '__getitem__'

Or even an attempt to make a wrapper:

train = dataset.get_subset("train")
print(train.get(0))

---> [96]     assert (subset or DEFAULT_SUBSET_NAME) == (self.name or DEFAULT_SUBSET_NAME)
AssertionError:

The problem I am running into is that the subsets are not able to be separated from the main dataset & are not treated as their own dataset. Could I be doing anything differently?

This is the main thing stopping me from using this really useful library in my pipeline, as I can really see potential but it doesn't offer specifically the dataloading features I am looking for (which might be on purpose). If anyone knows of a good method / tool to do this, I would love to hear! Thank you 😄

Oct 03 '24 22:10 HalkScout

datumaro datumaro copied to clipboard

[Question / Feature Request] PyTorch dataset abstraction

datumaro
datumaro copied to clipboard