polars
polars copied to clipboard
Request for Docs: torch
Hi @ritchie46 , thank you for your amazing work here! Polars is great and I've absolutely love working with it.
For my current challenge I'm working with large-ish parquet file that should not live fully in memory, since the available memory will be used by other functionality. Below some psuedo-code describing how I thought we should connect to pytorch. I was wondering if you could weigh in on if there is a better way to do this consider memory usage. If I understand correctly the below example would have the entire dataset in memory. Would it be possible to do something like this with a LazyFrame instead?
Once again, thank you for the amazing project and your presence here. 🙇
from typing import List, Tuple
from torch import Tensor
from torch.utils.data import Dataset
from torchvision.transforms import PILToTensor
class PolarsDataset(Dataset):
def __init__(self, file_path: Path, transforms: List) -> None:
self.df = pl.read_parquet(file_path, use_pyarrow=True, memory_map=True)
self.transform = PILToTensor()
def __getitem__(self, index: int) -> Tuple[Tensor, List[bool]]:
image_path = df[['image_path']].row(index)
image = self.transform(get_pil_image(image_path))
return image, *df[['label1', 'label2']].row(index)
I think you should write a generator that slices the file on the fly and caches and returns the rows you need.
@AlexanderVanEck : if you have found a solution in the mean time, I am sure there are people around here that would appreciate your solution to learn from.
My hunch would be to use PyArrow's scanner (https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Scanner.html) to construct a batch reader, where batch_size param can be used to control the number of rows to read. If you need Polars dataframe rather than Arrow table in PyTorch, you could cast a batch to a Polars dataframe using pl.from_arrow().
Btw, I am removing the feature label, because I think this is more a question asking for an example of how to do something, rather than a request for specific missing functionality in polars you would like to see being added. If you actually would like to see specific feature(s) being added, feel free to post here and update the label.
@AlexanderVanEck Apologize for bumping this thread. Did you have some progress with this? I would be interested in it and probably include it into a package I'm maintaining called PyTorch Tabular