polars icon indicating copy to clipboard operation
polars copied to clipboard

[feature-request] Allow opening files on remote filesystems / object stores

Open olivier-lacroix opened this issue 3 years ago • 9 comments

Hello there,

It appears polars does not have great support yet for remote filesystems (e.g. AWS S3 or GCS). It would be great to be able to refer easily to remote paths in IO functions. for instance:

import polars as pl

df = pl.read_csv("gs://my-bucket/my-file.csv")

Currently, one can use fsspec with read_csv, but NOT with scan_csv which only accepts a path

import polars as pl
import fsspec

# This works
with fsspec.open("gs://my-bucket/my-file.csv") as f:
     df = pl.read_csv(f)

# This fails
with fsspec.open("gs://my-bucket/my-file.csv") as f:
     df = pl.scan_csv(f)

olivier-lacroix avatar Jun 13 '21 06:06 olivier-lacroix

Currently the scan_csv expects a file because it must infer the file schema. With such cloud resources we'd have to download the first n rows and then restore the stream as is (for scanning the whole stream later). I don't see an easy solution for this at the moment.

Do such streams have a seek operation?

ritchie46 avatar Jun 14 '21 06:06 ritchie46

Hi @ritchie46 , makes sense.

As an example, from memory, I believe what Dask does in its lazy read_csv is the following:

  • if the schema is provided as an argument (dtype), do nothing
  • otherwise, read a chunk to infer schema

Yes, I believe (most of) these streams are seekable. In python, see fsspec for instance, and the methods seek and seekable from the AbstractBufferedFile class.

Using fsspec could help manage this nicely in the python wrapper; I have no idea if there is a nice wrapper like fsspec in rust.

olivier-lacroix avatar Jun 14 '21 07:06 olivier-lacroix

@ritchie46 ffspec supports differnt read buffering strategies: https://filesystem-spec.readthedocs.io/en/latest/api.html#readbuffering

fsspec.caching.ReadAheadCache(blocksize, …) | Cache which reads only when we get beyond a block of data
-- | --
fsspec.caching.BytesCache(blocksize, …[, trim]) | Cache which holds data in a in-memory bytes object
fsspec.caching.MMapCache(blocksize, fetcher, …) | memory-mapped sparse file cache
fsspec.caching.BlockCache(blocksize, …[, …]) | Cache holding memory as a set of blocks.

So a BlockCache could be used:

Cache holding memory as a set of blocks.

Requests are only ever made blocksize at a time, and are stored in an LRU cache. The least recently accessed block is discarded when more than maxblocks are stored.

Or if you want that rerunning a read_csv/scan_csv query is faster, a MMapCache (although that might only work on POSIX).

ghuls avatar Jun 23 '21 15:06 ghuls

PR opened (https://github.com/pola-rs/polars/pull/883) for partial support (only in eager read_* functions that were using the _prepare_file_args function).

Should also be implemented:

  • for remaining read_* functions
  • for Dataframe.to_* methods
  • for lazy io functions and methods

olivier-lacroix avatar Jun 27 '21 04:06 olivier-lacroix

@ritchie46 , as performance takes a significant hit when io is done via Python, and in many use cases data is stored in the cloud, it seems having access to cloud storage via rust is “necessary” to have a performant polars. Do you see this as something that is in scope for polars ? For arrow-rust / arrow2 ?

olivier-lacroix avatar Jul 07 '21 07:07 olivier-lacroix

@ritchie46 , as performance takes a significant hit when io is done via Python, and in many use cases data is stored in the cloud, it seems having access to cloud storage via rust is “necessary” to have a performant polars. Do you see this as something that is in scope for polars ? For arrow-rust / arrow2 ?

I think a fast solution is downloading the data to local temporary file. Then the local parsers can utilize mmap and multi-threading to process the files locally.

ritchie46 avatar Jul 07 '21 12:07 ritchie46

@ritchie46 How hard would it be to implement opening files from s3? I would like to try implementing this if you can give me info about how to do it. We are using datafusion for this in a company project now but it seems like this library can be better in terms of performance

ozgrakkurt avatar Sep 15 '22 13:09 ozgrakkurt

Maybe something like this would make sense to use here? https://github.com/Noeda/mmapurl

ozgrakkurt avatar Sep 15 '22 16:09 ozgrakkurt

@ozgrakkurt, the object_store crate you mention in https://github.com/pola-rs/polars/issues/4865 seems like a good way to go indeed, abstracting away the particulars of each object store implementation the way fsspec does it in python.

olivier-lacroix avatar Sep 15 '22 23:09 olivier-lacroix

This is now supported.

stinodego avatar Feb 17 '24 23:02 stinodego