polars icon indicating copy to clipboard operation
polars copied to clipboard

Request for Docs: torch

Open alexandervaneck opened this issue 3 years ago • 5 comments
trafficstars

Hi @ritchie46 , thank you for your amazing work here! Polars is great and I've absolutely love working with it.

For my current challenge I'm working with large-ish parquet file that should not live fully in memory, since the available memory will be used by other functionality. Below some psuedo-code describing how I thought we should connect to pytorch. I was wondering if you could weigh in on if there is a better way to do this consider memory usage. If I understand correctly the below example would have the entire dataset in memory. Would it be possible to do something like this with a LazyFrame instead?

Once again, thank you for the amazing project and your presence here. 🙇

from typing import List, Tuple
from torch import Tensor
from torch.utils.data import Dataset
from torchvision.transforms import PILToTensor


class PolarsDataset(Dataset):
    def __init__(self, file_path: Path, transforms: List) -> None:
        self.df = pl.read_parquet(file_path, use_pyarrow=True, memory_map=True)
        self.transform = PILToTensor()

    def __getitem__(self, index: int) -> Tuple[Tensor, List[bool]]:
        image_path = df[['image_path']].row(index)
        image = self.transform(get_pil_image(image_path))
        return image, *df[['label1', 'label2']].row(index)

alexandervaneck avatar Jun 13 '22 08:06 alexandervaneck

I think you should write a generator that slices the file on the fly and caches and returns the rows you need.

ritchie46 avatar Jun 14 '22 06:06 ritchie46

@AlexanderVanEck : if you have found a solution in the mean time, I am sure there are people around here that would appreciate your solution to learn from.

My hunch would be to use PyArrow's scanner (https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Scanner.html) to construct a batch reader, where batch_size param can be used to control the number of rows to read. If you need Polars dataframe rather than Arrow table in PyTorch, you could cast a batch to a Polars dataframe using pl.from_arrow().

zundertj avatar Jul 17 '22 16:07 zundertj

Btw, I am removing the feature label, because I think this is more a question asking for an example of how to do something, rather than a request for specific missing functionality in polars you would like to see being added. If you actually would like to see specific feature(s) being added, feel free to post here and update the label.

zundertj avatar Jul 17 '22 16:07 zundertj

@AlexanderVanEck Apologize for bumping this thread. Did you have some progress with this? I would be interested in it and probably include it into a package I'm maintaining called PyTorch Tabular

manujosephv avatar May 14 '23 13:05 manujosephv