datumaro
datumaro copied to clipboard
[Question / Feature Request] PyTorch dataset abstraction
All of the documentation I have sifted through has referenced essentially re-saving data when the format changes, but is there a way to use this library without that? A good use-case for this example is if you have a very large amount of data. You can read your data if it's a supported format, some conversion occurs, then you pass it into your pipeline. It would add to the cost of data-loading, but it can be worth it if it saves terabytes of disk space.
I am loading a dataset like this:
from datumaro.components.dataset import Dataset
dataset = Dataset.import_from("./data", "yolo")
print(dataset)
Dataset
...
subsets
test: # of items=...
train: # of items=...
val: # of items=...
infos
...
Where I would want to implement a PyTorch Lightning data module like:
import lightning as L
from datumaro.components.dataset import Dataset
class MyDataModule(L.LightningDataModule):
def __init__(self, batch_size: int = 32):
super().__init__()
self.batch_size = batch_size
def setup(self, stage: str):
# Being able to load only specific subsets would be nice here too, but that sounds like a large undertaking
dataset = Dataset.import_from("./data", "yolo")
if stage == "fit":
dataset_train = dataset.get_subset("train")
dataset_val = dataset.get_subset("val")
if stage == "test":
dataset_test = dataset.get_subset("test")
def train_dataloader(self):
return DataLoader(self.dataset_train, batch_size=self.batch_size)
# and so on for "test" and "val"
Is this possible? Neither of these solutions I have tried work:
train = dataset.get_subset("train")
print(train.__getitem__(0))
AttributeError: 'DatasetSubset' object has no attribute '__getitem__'
Or even an attempt to make a wrapper:
train = dataset.get_subset("train")
print(train.get(0))
---> [96] assert (subset or DEFAULT_SUBSET_NAME) == (self.name or DEFAULT_SUBSET_NAME)
AssertionError:
The problem I am running into is that the subsets are not able to be separated from the main dataset & are not treated as their own dataset. Could I be doing anything differently?
This is the main thing stopping me from using this really useful library in my pipeline, as I can really see potential but it doesn't offer specifically the dataloading features I am looking for (which might be on purpose). If anyone knows of a good method / tool to do this, I would love to hear! Thank you 😄