polars
polars copied to clipboard
[feature-request] Allow opening files on remote filesystems / object stores
Hello there,
It appears polars does not have great support yet for remote filesystems (e.g. AWS S3 or GCS). It would be great to be able to refer easily to remote paths in IO functions. for instance:
import polars as pl
df = pl.read_csv("gs://my-bucket/my-file.csv")
Currently, one can use fsspec with read_csv, but NOT with scan_csv which only accepts a path
import polars as pl
import fsspec
# This works
with fsspec.open("gs://my-bucket/my-file.csv") as f:
df = pl.read_csv(f)
# This fails
with fsspec.open("gs://my-bucket/my-file.csv") as f:
df = pl.scan_csv(f)
Currently the scan_csv
expects a file because it must infer the file schema. With such cloud resources we'd have to download the first n
rows and then restore the stream as is (for scanning the whole stream later). I don't see an easy solution for this at the moment.
Do such streams have a seek operation?
Hi @ritchie46 , makes sense.
As an example, from memory, I believe what Dask does in its lazy read_csv is the following:
- if the schema is provided as an argument (
dtype
), do nothing - otherwise, read a chunk to infer schema
Yes, I believe (most of) these streams are seekable. In python, see fsspec for instance, and the methods seek
and seekable
from the AbstractBufferedFile
class.
Using fsspec could help manage this nicely in the python wrapper; I have no idea if there is a nice wrapper like fsspec in rust.
@ritchie46 ffspec supports differnt read buffering strategies: https://filesystem-spec.readthedocs.io/en/latest/api.html#readbuffering
fsspec.caching.ReadAheadCache(blocksize, …) | Cache which reads only when we get beyond a block of data
-- | --
fsspec.caching.BytesCache(blocksize, …[, trim]) | Cache which holds data in a in-memory bytes object
fsspec.caching.MMapCache(blocksize, fetcher, …) | memory-mapped sparse file cache
fsspec.caching.BlockCache(blocksize, …[, …]) | Cache holding memory as a set of blocks.
So a BlockCache could be used:
Cache holding memory as a set of blocks.
Requests are only ever made blocksize at a time, and are stored in an LRU cache. The least recently accessed block is discarded when more than maxblocks are stored.
Or if you want that rerunning a read_csv/scan_csv query is faster, a MMapCache (although that might only work on POSIX).
PR opened (https://github.com/pola-rs/polars/pull/883) for partial support (only in eager read_* functions that were using the _prepare_file_args
function).
Should also be implemented:
- for remaining read_* functions
- for Dataframe.to_* methods
- for lazy io functions and methods
@ritchie46 , as performance takes a significant hit when io is done via Python, and in many use cases data is stored in the cloud, it seems having access to cloud storage via rust is “necessary” to have a performant polars. Do you see this as something that is in scope for polars ? For arrow-rust / arrow2 ?
@ritchie46 , as performance takes a significant hit when io is done via Python, and in many use cases data is stored in the cloud, it seems having access to cloud storage via rust is “necessary” to have a performant polars. Do you see this as something that is in scope for polars ? For arrow-rust / arrow2 ?
I think a fast solution is downloading the data to local temporary file. Then the local parsers can utilize mmap
and multi-threading to process the files locally.
@ritchie46 How hard would it be to implement opening files from s3? I would like to try implementing this if you can give me info about how to do it. We are using datafusion for this in a company project now but it seems like this library can be better in terms of performance
Maybe something like this would make sense to use here? https://github.com/Noeda/mmapurl
@ozgrakkurt, the object_store crate you mention in https://github.com/pola-rs/polars/issues/4865 seems like a good way to go indeed, abstracting away the particulars of each object store implementation the way fsspec does it in python.
This is now supported.