dask-geopandas
dask-geopandas copied to clipboard
ENH: read a list of GIS files into chunks
I have a list of GeoPackages, one for an urban area, I need to read to dask.GeoDataFrame. Since they are already essentially spatially partitioned, the optimal way would be to read each as a chunk directly. Now I have to read them one by one via GeoPandas, concatenate and then create dask.GeoDataFrame from geopandas.GeoDataFrame, which loses spatial partitions.
For cases like this, it may be useful to have dask_geopandas.read_files(list)
function which would call geopandas.read_file
for each chunk and create chunked GeoDataFrame directly. It would be helpful to be able to pass both list
and a path to a folder (like we do with parquet) since in the list you can specify a path in the zip for example (my case).
This is the existing code I am using:
paths = ["foo/bar/one.zip!data/file.gpkg", "foo/bar/two.zip!data/file.gpkg"]
gdfs = []
for file in paths:
gdf = gpd.read_file(file)
gdfs.append(gdf)
gdf = pd.concat(gdfs)
ddf = dask_geopandas.from_geopandas(gdf, npartitions=2) # non spatial chunks
And this would be optimal:
paths = ["foo/bar/one.zip!data/file.gpkg", "foo/bar/two.zip!data/file.gpkg"]
ddf = dask_geopandas.read_files(paths) # one chunk per file
Actually, looking at the dask API, it should probably be a feature of dask_geopandas.read_file
(xref #11) in a similar sense dask.dataframe has read_csv
.
I suppose we could actually even start with a dask_geopandas.read_file
that only supports this use case, as it seems simpler than chunking one file (#11).
In dask the logic behind read_csv (creating the parts, before reading the actual csv) is in read_bytes: https://github.com/dask/dask/blob/714ff4b92df032f50acad205ddbc3b7103eb399f/dask/bytes/core.py#L12. It seems this uses fsspec's get_fs_token_paths
to convert a path into a list of paths (https://github.com/intake/filesystem_spec/blob/8ea35ab5179deb57108096b628f9cf09f834d3a2/fsspec/core.py#L534). This could handle converting glob-like strings into a list of paths for us, but not sure if it would pass through a list of such paths with fiona-style !
layer selection.