dataset-viewer
dataset-viewer copied to clipboard
Support vectorial geospatial columns
Requires https://github.com/huggingface/datasets/issues/6438, to support GeoParquet. We could support more formats.
Possibly requires geopandas as a dependency.
at least, https://github.com/huggingface/datasets-server/issues/2428 will "Read GeoParquet files using parquet reader" https://github.com/huggingface/datasets/pull/6508
Thanks @severo for opening this! As I understand, is an update needed on the server to pull in https://github.com/huggingface/datasets/pull/6508, so that GeoParquet datasets like https://huggingface.co/datasets/joshuasundance/govgis_nov2023-slim-spatial will show up on the Dataset Viewer?
It does :)
Note that we only have the first 100 rows on this dataset, because we ran into two other issues!
- size of the row groups in the geoparquet files:
worker.job_runners.config.parquet_and_info.TooBigRowGroupsError: Parquet file has too big row groups. First row group has 950423110 which exceeds the limit of 300000000
- issue with the features:
datasets.table.CastError: Couldn't cast
id: string
name: string
type: string
description: string
url: string
metadata_text: string
embeddings: list<element: double>
child 0, element: double
geometry: binary
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 1178
geo: '{"primary_column": "geometry", "columns": {"geometry": {"encoding":' + 1306
to
{'id': Value(dtype='string', id=None), 'name': Value(dtype='string', id=None), 'type': Value(dtype='string', id=None), 'description': Value(dtype='string', id=None), 'url': Value(dtype='string', id=None), 'metadata_text': Value(dtype='string', id=None), 'geometry': Value(dtype='binary', id=None)}
because column names don't match
Awesome, this is a big step forward!
size of the row groups in the geoparquet files:
To be honest, 950423110 does seem like a bit much for a single row group, but a row group shouldn't be too small either. DuckDB has some nice about this here - https://duckdb.org/docs/guides/performance/file_formats#the-effect-of-row-group-sizes
issue with the features:
Hmm, which field is the CastError
on? Is it something in the schema metadata? The log seems truncated or something, so I can't quite tell.
the full traceback
Traceback (most recent call last):
File "/src/services/worker/src/worker/job_runners/config/parquet_and_info.py", line 1169, in compute_config_parquet_and_info_response
fill_builder_info(builder, hf_endpoint=hf_endpoint, hf_token=hf_token, validate=validate)
File "/src/services/worker/src/worker/job_runners/config/parquet_and_info.py", line 619, in fill_builder_info
parquet_files_and_sizes: list[tuple[pq.ParquetFile, int]] = thread_map(
File "/src/services/worker/.venv/lib/python3.9/site-packages/tqdm/contrib/concurrent.py", line 69, in thread_map
return _executor_map(ThreadPoolExecutor, fn, *iterables, **tqdm_kwargs)
File "/src/services/worker/.venv/lib/python3.9/site-packages/tqdm/contrib/concurrent.py", line 51, in _executor_map
return list(tqdm_class(ex.map(fn, *iterables, chunksize=chunksize), **kwargs))
File "/src/services/worker/.venv/lib/python3.9/site-packages/tqdm/std.py", line 1166, in __iter__
for obj in iterable:
File "/usr/local/lib/python3.9/concurrent/futures/_base.py", line 609, in result_iterator
yield fs.pop().result()
File "/usr/local/lib/python3.9/concurrent/futures/_base.py", line 446, in result
return self.__get_result()
File "/usr/local/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result
raise self._exception
File "/usr/local/lib/python3.9/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
File "/src/services/worker/src/worker/job_runners/config/parquet_and_info.py", line 556, in retry_and_validate_get_parquet_file_and_size
validate(pf)
File "/src/services/worker/src/worker/job_runners/config/parquet_and_info.py", line 588, in validate
raise TooBigRowGroupsError(
worker.job_runners.config.parquet_and_info.TooBigRowGroupsError: Parquet file has too big row groups. First row group has 950423110 which exceeds the limit of 300000000
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/builder.py", line 1973, in _prepare_split_single
for _, table in generator:
File "/src/services/worker/src/worker/job_runners/config/parquet_and_info.py", line 712, in wrapped
for item in generator(*args, **kwargs):
File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/packaged_modules/parquet/parquet.py", line 94, in _generate_tables
yield f"{file_idx}_{batch_idx}", self._cast_table(pa_table)
File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/packaged_modules/parquet/parquet.py", line 74, in _cast_table
pa_table = table_cast(pa_table, self.info.features.arrow_schema)
File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/table.py", line 2240, in table_cast
return cast_table_to_schema(table, schema)
File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/table.py", line 2194, in cast_table_to_schema
raise CastError(
datasets.table.CastError: Couldn't cast
id: string
name: string
type: string
description: string
url: string
metadata_text: string
embeddings: list<element: double>
child 0, element: double
geometry: binary
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 1178
geo: '{"primary_column": "geometry", "columns": {"geometry": {"encoding":' + 1306
to
{'id': Value(dtype='string', id=None), 'name': Value(dtype='string', id=None), 'type': Value(dtype='string', id=None), 'description': Value(dtype='string', id=None), 'url': Value(dtype='string', id=None), 'metadata_text': Value(dtype='string', id=None), 'geometry': Value(dtype='binary', id=None)}
because column names don't match
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/src/services/worker/src/worker/job_manager.py", line 158, in process
job_result = self.job_runner.compute()
File "/src/services/worker/src/worker/job_runners/config/parquet_and_info.py", line 1288, in compute
compute_config_parquet_and_info_response(
File "/src/services/worker/src/worker/job_runners/config/parquet_and_info.py", line 1178, in compute_config_parquet_and_info_response
parquet_operations, partial = stream_convert_to_parquet(
File "/src/services/worker/src/worker/job_runners/config/parquet_and_info.py", line 802, in stream_convert_to_parquet
builder._prepare_split(
File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/builder.py", line 1860, in _prepare_split
for job_id, done, content in self._prepare_split_single(
File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/builder.py", line 2016, in _prepare_split_single
raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset
related (raster, not vectorial: geotiff) https://github.com/huggingface/datasets/issues/6740
geopandas has reached 1.0.0
https://github.com/geopandas/geopandas/releases/tag/v1.0.0